research papers
Detection and correction of underassigned rotational symmetry prior to structure deposition
^{a}Physical Biosciences Division, Lawrence Berkeley National Laboratory, One Cyclotron Road, Berkeley, CA 94720, USA
^{*}Correspondence email: nksauter@lbl.gov
Up to 2% of Xray structures in the Protein Data Bank (PDB) potentially fit into a higher symmetry https://cci.lbl.gov/labelit.
Redundant protein chains in these structures can be made compatible with exact with minimal atomic movements that are smaller than the expected range of coordinate uncertainty. The incidence of problem cases is somewhat difficult to define precisely, as there is no clear line between underassigned symmetry, in which the subunit differences are unsupported by the data, and in which the subunit differences rest on small but significant intensity differences in the diffraction pattern. To help catch symmetryassignment problems in the future, it is useful to add a validation step that operates on the refined coordinates just prior to structure deposition. If redundant symmetryrelated chains can be removed at this stage, the resulting model (in a higher symmetry space group) can readily serve as an starting point for rerefinement using reindexed and reintegrated raw data. These ideas are implemented in new software tools available atKeywords: underassigned rotational symmetry; LABELIT; validation.
1. Introduction
The accuracy of the molecular model derived from Xray crystallography is inherently limited by measurement uncertainty in the structure factors and intrinsic disorder of the crystal. Indeed, atomic level accuracy is only possible if the dataset resolution approaches or exceeds 1.0 Å (see Afonine et al., 2007, and references therein). At lower resolutions, prior assumptions about the stereochemistry are required in order to sufficiently restrain the process (Hendrickson, 1985). Likewise, restraints arising from noncrystallographic symmetry (NCS) averaging are important for shaping the molecular envelope and producing interpretable electrondensity maps (Jones & Liljas, 1984). However, in view of the probabilistic nature of these restraints it is best to exploit true constraints such as when they are available. Symmetry constraints have two benefits: merging of the symmetryequivalent reflections increases the accuracy of the measured structure factors and modeling the rather than the entire markedly decreases the number of parameters in the molecular model. A failure to identify the highest spacegroup symmetry compatible with the observations can have severe consequences for model building (Kleywegt et al., 1996), leading to unwarranted conclusions about the biology of the system under study.
This paper deals with the issue of finding potentially higher δ angle between the axis vectors expressed in direct and (Le Page, 1982). If the δ angle is identically zero the axis qualifies as a rotational symmetry operator as far as the unitcell measurements are concerned. However, practical experience with typical rotation photography experiments shows that an allowance must be made for deviations as high as 1.4° from perfect alignment in order to construct the highest symmetry Bravais type consistent with the data (Sauter et al., 2004, 2006).
given a particular data set and model. While the choice of is a routine aspect of structure solution, it is worth keeping in mind that experimental measurements never establish the with absolute confidence. There are always physical uncertainties to be considered both in the positions and the intensities of the Bragg reflections. Uncertainties in Bragg spot position affect the first step of spacegroup assignment, in which the crystal is classified into one of 14 Bravais types (based on the metric symmetry of the unitcell dimensions). Starting with the three lengths and three angles of the a convenient way to evaluate a potential symmetry axis is to compute theBeyond the classification of Bravais type, measurement uncertainties in the Bragg intensities can potentially hinder the assignment of the diffraction's symmetry. Here again it is possible to evaluate individual symmetry operators based on the agreement of symmetryrelated intensity measurements. (Friedel mates are treated as equivalent throughout this paper, regardless of whether there is an R_{symop} as the average percentage difference between pairs of symmetryrelated intensity measurements (equation 2 in Sauter et al., 2006), this statistic is ideally zero for a valid However, nonzero values of up to 25% must be permitted (to account for poor measurement and/or anomalous signals) in order to assemble an optimal set of operators to describe the diffraction symmetry (Sauter et al., 2006; Evans, 2006).
signal.) Defining the symmetryoperator reliabilityIt would be desirable if the acceptable tolerances chosen for δ and R_{symop} could always be large enough to reflect the physical uncertainties for the specific experiment, but there is no established method to make this guarantee. Either by intention or by mistake structures can be solved in space groups with symmetries that are too low. Indeed, from time to time it has been remarked (Hooft et al., 1994, 1996; Zwart et al., 2008) that certain structures deposited in the PDB (Berman et al., 2003) appear to have redundant subunit chains that are related by unassigned rotational symmetry operators. Furthermore, we have observed that some commonly used methods to determine the are susceptible to numerical instability (GrosseKunstleve et al., 2004; Sauter et al., 2004), making it possible for highsymmetry Bravais types to be improperly identified, such as hexagonal rhombohedral (hR) being assigned as Ccentered monoclinic (mC).
For smallmolecule crystal structures, cases requiring reassignment into a higher ; Marsh, 1995, 1997, 2009; Marsh & Spek, 2001) and symmetryvalidation software is available (Le Page, 1988; Palatinus & van der Lee, 2008; Spek, 2009). Here, we perform a similar function for the macromolecular field, surveying the entire PDB for underassigned rotational symmetry operators. [We address neither underassigned translational symmetry operators, as was performed recently by Zwart et al. (2005, 2008), nor the topic of as has been covered by Lebedev et al. (2006).] Since we do not usually have recourse to the original raw data images, no judgements are made about the true crystallographic symmetry in individual cases. Rather, we develop scoring tools to quantify how closely a particular atomic model appears to fit into a higher symmetry, and coordinatemanipulation tools to interconvert models between space groups. The tools are intended to be used by the original investigator for validating the model at any stage prior to structure deposition or for correcting a model that is deemed suitable for reanalysis in a higher symmetry.
have been well documented (Marsh & Herbstein, 19882. Computational methods
Software development was greatly facilitated by the framework provided by the opensource Computational Crystallography Toolbox (cctbx; GrosseKunstleve et al., 2002, 2006). PDB coordinate files from https://wwpdb.org were parsed with the cctbx.iotbx.pdb file reader. Analysis was restricted to coordinate sets determined by Xray crystallography and additionally to proteins rather than Solvent molecules, ligands, covalent modifications and alternate conformations were ignored. Structure factors from the PDB, when available, were validated with phenix.cif_as_mtz (Urzhumtseva et al., 2009) to assure consistency with the corresponding PDB coordinate entry. Raw diffraction images for selected cases were downloaded from the Joint Center for Structural Genomics (JCSG; https://www.jcsg.org).
2.1. Automated structure solution in all possible subgroups
Before proceeding with the allPDB survey, we wish to confirm that the true symmetry can be deduced from the atomic model if the structure is intentionally solved in a lower symmetry 3b77 (Table S2^{1} gives further examples). After integrating the 3b77 data set in the triclinic setting, merging trials performed with labelit.rsymop (Sauter et al., 2006) show that the Bragg intensities, together with the unitcell dimensions, are consistent with Patterson symmetries P4/m, P12/m1 or . To obtain structure solutions in all three possible symmetries, the data were reintegrated, scaled and merged separately in each of these settings. Molecularreplacement solutions were determined with the program phenix.automr (McCoy et al., 2007) using the published P4 structure as a replacement model. Solutions A1, A2 and A3 (corresponding to the three symmetries noted above) were then built and refined with phenix.autobuild (Terwilliger et al., 2008). As this particular data set consists of a 90° rotation wedge intended for a tetragonal structure, the completeness of the data is quite low (57% out to a limiting resolution of 3.5 Å) when processed in the triclinic setting; however, this is still sufficient for the present purpose. To afford a comparison between crystallographic R factors (Table 1), each structure is refined at the same resolution and the same set of freeR flags as initially calculated for highest symmetry (P4) is expanded into the monoclinic and triclinic settings.
Such structures were generated automatically using original JCSG data sets as a starting point and are illustrated here using PDB entry

2.2. Relating the input symmetry to potential higher symmetries
In principle, it should be straightforward to check whether an atomic model can be reassigned to a higher symmetry target G. One simply lists the symmetry operators of the target and selects the operators that are absent in the input H. Applying these trial operators to the input structure will leave both atomic coordinates and structurefactor intensities invariant if the target symmetry is valid.
In practice this calculation is fairly complicated since space groups are conventionally expressed in different reference frames (Hahn, 1996). In the general case, the input and target symmetries will have different unitcell basis vectors a, b, c and choices of origin. To assure that H is a of G a single point of view must be chosen, and the approach taken here is to perform all comparisons in the reference frame of the target symmetry. Converting from the input to the target reference frame requires the sequence of transformations depicted in Fig. 1. Beginning with the initial setting, a change of basis (Boisen & Gibbs, 1990) is applied to remove any centering operations (GrosseKunstleve, 1999). This is then changed to a standard reduced setting (the `minimum' setting defined in GrosseKunstleve et al., 2004). To afford comparisons between Bragg reflections that are potentially symmetryequivalent, we enumerate all Patterson settings that align with the cell to within a tolerance δ (Sauter et al., 2006) and change the basis to each of these metric group settings in turn. Having selected one of these metric settings (see §2.3), we then need to evaluate all of the candidate space groups that share the same Patterson symmetry as the metric group, each requiring a basis change from the reduced setting to the candidate setting. At this point, a fractional translation (see §2.4) must be applied so that duplicate polypeptide chains are correctly related by the the candidate space group's rotational symmetry operators. A final adjustment to the conventional setting is necessary in certain cases, particularly those orthorhombic cases in which the target symmetry is in a nonstandard setting such as P2_{1}22, which must be converted to the standard setting (P222_{1}) by an axis swap.
Each transformation in this sequence is represented by a changeofbasis operator, which combines a rotation matrix and a translation vector, each containing rationalvalued elements. These operators are mathematically associative, so that the total transformation from input to target setting is succinctly expressed as a single rotation R and translation T. As detailed elsewhere (Giacovazzo et al., 1992; Sauter et al., 2006), the transformation (R, T) can be applied to fractional coordinates, and symmetry operations from the input structure in order to reexpress them in the target reference frame. It is important to realise that the entire input structure is moved as a rigid body under the operation (R, T), so the symmetry properties of the structure do not change during the transformation. It is just a matter of convenience to move the structure into the same reference frame where we already have a list of the trial symmetry operators of the target space group.
2.3. Evaluation of the Patterson symmetry
We expected the possibility that the models from §2.1 solved in suboptimal space groups (A2 and A3) would have poorer crystallographic R factors than the optimal model A1. Instead, we found that the Rfactor statistic did not help at all to distinguish between the best symmetry and the underassigned symmetry. The implication is one of caution: if the optimal Patterson symmetry is passed over at the stage of indexing and integration then the modelbuilding and process may be completed successfully without any indication of the oversight.
Fortunately, the model itself can be examined (following the approach of §2.2) to assess its compatibility with higher symmetry. A first step (Tables 2 and 3) is to establish missing symmetry operators based on backcalculated reflection intensities, I^{calc}. After expanding the atomic coordinate model to P1, the unitcell measurements are used to construct the largest possible set of lattice symmetry operators, as described previously (Sauter et al., 2006). Each potential operator W is then independently scored based on the agreement of symmetryrelated intensities,
where is a sum over all pairs of Bragg spots related by W and is a sum over both members of the pair. Low R_{symop} values indicate valid rotational symmetry in and in the illustrated example it is apparent that there is a fourfold rotation along the z axis (Table 2). The fourfold is equally clear regardless of whether the model is taken from the monoclinic or the triclinic structure. The triclinic structure (A3) additionally reveals a twofold symmetry along the z axis, while the monoclinic model (A2) already assumes the presence of this twofold, so the R_{symop} value for this operator is zero.


In Table 3 the lattice symmetry operators are grouped together to show all possible Patterson settings consistent with the (to within the small angular tolerance δ). Each setting is scored by tabulating the worstcase symmetryequivalence measure (R_{symop}), considering all operators in the group. As expected, the illustrated example (triclinic structure A3) is consistent with only three of the metrically possible Patterson settings, namely P4/m, P12/m1 and , and not with any groups containing a twofold in the xy plane.
We arrive at the same conclusions about symmetry if we use the experimentally observed data (Tables 2 and 3) rather than modelcalculated intensities. Starting with merged structurefactor amplitudes F^{obs}, the observations are expanded to P1, reexpressed as reflection intensities (I^{obs}) and used in (1) instead of I^{calc}. This methodology is readily used to evaluate the potential Patterson settings in any deposited reflection file from the PDB.
2.4. Identification of the and positioning of the model
Symmetryequivalence of the reflections (1), together with knowledge of the establishes the highest possible Patterson symmetry, but two questions remain to be answered: what is the and where should the model be placed in the higher symmetry unit cell? Taking the example of structure A3, we wish to know which of the tetragonal space groups to focus on (P4, P4_{1}, P4_{2} or P4_{3}) and where to place the polypeptide in relation to the z axis.
We begin by defining x, the fractional origin shift that must be applied in the setting of the target G to the input model in order to properly position it within the higher symmetry (denoted as `translation x' in Fig. 1). The model is correctly positioned when the application of spacegroup symmetry operators leaves the model invariant. In view of the prohibitive computational cost of translating the model to every position in the we adopt a method from Navaza & Vernoslova (1995), dramatically speeding up the calculation by gauging the correlation between two types of calculated Bragg intensity: I^{merge,G} and I^{ensemble,G}(x). I^{merge,G} is simply the set of reflection intensities calculated by expanding the atomic coordinates of the present model into P1 and merging the symmetry equivalents under G. I^{ensemble,G}(x) is the result of applying the origin shift x, thus repositioning the model in the The symmetry elements of G are then applied, giving a hypothetical ensemble containing multiple copies of the P1 model superimposed upon each other (one copy for each symmetry operator) from which intensities I^{ensemble,G}(x) are calculated. The agreement between present model, origin shift and is described by the Pearson correlation coefficient
where 〈 〉 is the average over all H and ΔI_{H} = I_{H} − 〈I〉 is the deviation between the calculated intensity for a given and the average over all intensities. Navaza and Vernoslova's Fast Fourier approach for calculating r(x, G) is computationally tractable even for large structures.
Peaks in the r(x, G) map that approach a value of 1.0 represent candidate translations for positioning the model into the target In the illustrated example (Figs. 2a–2d), the relatively low correlation coefficients under P4_{1} and P4_{3} allow us to rule out these space groups, while space groups P4 and P4_{2} are both shown to be viable candidates as far as intensity correlations are concerned. When viewing these correlation maps it is useful to realise that the r(x, G) function has a special type of symmetry variously called the Cheshire group (Hirshfeld, 1968) or affine (Koch & Fischer, 1996); the effect of this is to restrict the range of possible origin shifts to an area or volume smaller than the of G. For the four tetragonal space groups under consideration r(x, G) is independent of the position along the fourfold, so it is only necessary to illustrate a single section in Figs. 2(a)–2(d).
The is very efficient for discriminating among origin shifts, but in this case it does not distinguish between the two candidate models that might be consistent with structure A3: a P4 model with origin shift x_{max} = 0 (Fig. 2e) and a P4_{2} model shifted by x_{max} = ½c (Fig. 2f). The latter model happens to be incorrect in the sense that application of the 4_{2} screw leads to an atomic model (red circles in Fig. 2f) that sterically clashes with the starting model (blue circles) rather than aligning with it; each is effectively duplicated. Yet the calculated intensities for the two sets of asymmetric units are identical since intensities are invariant under the screw axis operator. What is missing in (2) is a recognition that the screw operation affects the structurefactor phase, even though it does not affect the amplitude.
of (2)Properly accounting for phases requires a separate calculation. We take the input model (triclinic structure A3 in this case), apply the origin shift x_{max} determined above, and then consider the calculated structure factors F^{calc} and phases φ^{calc}. Looking separately at each symmetry operator g_{i} of G, a weighted phase difference factor is used to construct a symmetry agreement score as suggested by Palatinus & van der Lee (2008),
In this expression, symmetry operator g_{i} has a rotational part W and a translational part w. The normalization constant C and modular integer n are as described in Palatinus & van der Lee (2008). Models that are invariant under the will have equal values of φ_{H}^{calc} and φ_{HW}^{calc} + 2πH·w, so the score will be zero. In our example, the symmetry agreement scores φ(4) = 0.002 and φ(4_{2}) = 0.578 clearly establish the correct as P4.
2.5. Positional of the higher symmetry model
The Pearson is evaluated on a grid whose granularity is approximately half the limiting resolution of the diffraction. Therefore, the origin shift x_{max} from §2.4 is only a first approximation. Indeed, the displacement between the atomic model and the symmetry axes of the should arguably be the most precise element of any structure. Since the displacement is derived jointly from the positions of all the atoms, its uncertainty should be a tiny fraction of a bond length. It is thus appropriate to subject the origin shift to additional Furthermore, while (3) scores the symmetry agreement of structure factors in it is also desirable to quantify the symmetry based on the atomic model in (or even to provide a computergraphics snapshot of superimposed symmetryequivalent molecules), giving a better intuitive grasp of the symmetry fit. This section presents methods for addressing these issues.
(2)2.5.1. Matching of symmetryequivalent molecules aided by decomposition
As noted in §2.2, we judge a target G by applying symmetry operators present in G that are absent in the input H. The relationship between group G and its H can be most usefully explored by the decomposition tools of group theory. In particular, the left decomposition of G with respect to H is defined as
In this expansion, G is broken down into a series of n subsets (left cosets) generated by applying the symmetry operators g_{i} ∈ G to each element of H. Operator g_{1} is defined to be the identity, while the elements g_{2}…g_{n}, termed left representatives, are the elements that require evaluation as trial symmetry operators for the The choice of which elements to count as left representatives is not unique; within each left any element can be chosen as the representative with equivalent results. The important property here is that only one representative from each need be considered.
The ). Matching polypeptide pairs (X to Y) are then determined for each representative g_{i} using a triple loop. In the outer loop, g_{i} is applied to each polypeptide chain X of the In the middle loop, each polypeptide chain Y is evaluated as a matching target (with the requirement that Y is only considered as a candidate if X and Y have similar aminoacid sequences). In the innermost loop, each operator h ∈ H is applied to Y and a match is declared if the coordinates approximately superimpose,
expansion makes it possible to quantify how close the noncrystallographic symmetry relationships of a structure come to crystallographic exactness. A necessary first step is to derive trial mappings of the contents to itself, one mapping for each The algorithm begins by originshifting the input structure to the optimized setting (Fig. 1In this expression, the atomic coordinates of X and Y are expressed in fractional coordinates and t represents an allowable translation vector on the lattice (one containing fullinteger components). Superposition is determined using the method of Kearsley (1989) and calculations throughout this paper are limited to the C^{α} atoms of polypeptide chains.
A simple example of chain matching is illustrated in Fig. 3. There are 12 identical in the of the monoclinic structure A2. Adapting the input (H = P2) into the target (G = P4) leads to the decomposition
where the numerical symbols are intended to represent the identity operator 1, the twofold rotation 2 of P2 and the fourfold g_{2} = 4^{+} chosen as the single left representative. Under the operation of g_{2}, polypeptide chains A–F map to chains G–L, while chains G–L map to chains A′–F′ in the second of the monoclinic cell (corresponding to h = 2).
2.5.2. Highprecision of the origin shift
In the preceding section, the approximately known origin shift x is used to discover symmetrymatched peptide pairs. We now turn this process around, performing leastsquares on these known matches to produce the best possible chain alignment, while considering x to be a free variable. For these purposes we revert the atomic coordinates back to the candidate group setting (Fig. 1) prior to the application of the origin shift. The function to be minimized is the Cartesian square difference between chainmatched C^{α} positions,
The outer summation here is over all n cosets except for the first one, which just produces the identity mapping. The middle sum is over all P polypeptide chains in the and the inner sum is over the N_{α} C^{α} pairs in the jth matching pair of chains (X, Y). Operator g_{i} is the ith representative, while t_{XY} and h_{XY} are the translational and rotational symmetry operators in H required to produce a match between chains X and Y (5). Matrix O is the orthogonalization matrix required to convert fractional to Cartesian coordinates. After minimization of the function f, the refined origin shift is used to recalculate the optimized structure (Fig. 1).
Having determined the final origin shift, the input structure's fit with target G can now be evaluated. If the structure is perfectly invariant when the representative operators are applied, the value of the function f will be identically zero. The deviation from perfect symmetry can be expressed as the rootmeansquared deviation of C^{α} atoms from their symmetrypredicted positions,
where ΣN symbolizes the total count of C^{α} matches over all matching polypeptide pairs and all cosets in the triple sum of (7).
2.5.3. Generating coordinate sets corresponding to each asymmetric unit
Imposing additional symmetry on a structure implies that the number of unique polymer chains will be reduced; in fact, the resulting P/n chains, the original number of chains divided by the number of cosets. The chainmatching results of §2.5.1 can be used to construct approximate models of the higher symmetry The key idea is to select one chain from each group of mutual chain matches; e.g. in Fig. 3 one chain is selected from each of the six groups {A, G}, {B, H}, {C, I}, {D, J}, {E, K} and {F, L}. While there are many possible combinations [n^{(P/n)}], we take the simple expedient of selecting the polypeptide from each group that appears first in the original PDB input file, so in this case chains A–F are selected as the primary model of the To visualize the extent to which the input structure differs from the perfect symmetry of G, n − 1 additional models are then generated, one for each These arise by looping over the X of the primary model and transforming their matched Y with
will contain exactlythus placing the matching chains and the primary model in approximate alignment. The end product of this exercise is a set of n different models of the higher symmetry nearly superimposed, with differences among models reflecting the NCS variability of the input structure. These models can be readily output as PDBformat files for visual inspection and further analysis. In the example of Fig. 3, the two models consist of chains A–F and G–L, respectively.
2.5.4. Interpreting the deviation from perfect symmetry
Differences among these Δr_{sym} measure of (8) contains both components, but it is also informative to separate the rigidbody and residual terms. To evaluate the residual component by itself, we perform a Kearsley (1989) alignment of the entire C^{α} contents of ASU models i and j, and evaluate the root meansquared deviation of superimposed atoms, Δr_{ij}. Averaging this quantity over all ((n)  2) pairwise combinations of ASU models, the overall residual component can be expressed as
(ASU) models combine two types of variation: a rigidbody component describing the motion of the contents as a whole and a residual component reflecting the positions of individual atoms. Thewhere N_{ij} is the total number of C^{α} matches between ASU models i and j. For cases where the ASU model contains more than one polypeptide chain, an additional measure of the residual term, Δr_{chain}, is defined to represent deviations of atoms within individual chains. This quantity is calculated in an identical manner to (10) except that the Kearsley alignment is performed on individual pairs of and the resulting summation contains (P/n)((n)  2) terms.
Values for Δr_{sym}, Δr_{ASU} and Δr_{chain} for structures A2 and A3 are reported in Table 4. The predominant contribution to the NCS differences in these structures is from random deviations of individual atoms of the order of 0.1 Å. There is only an insignificant contribution (0.002 Å in structure A2 and 0.03 Å in structure A3) from rigidbody rearrangements of polypeptide chains.

2.6. Reindexing the diffraction images in higher symmetry
We now suppose that a decision has been made to increase the symmetry of the atomic model. Clearly, the best outcome can be achieved by returning to the original diffraction images. Imposing the new G (P4 in the case of structures A2 and A3) on the original data will permit better unitcell constraints for the prediction of spot positions during integration, afford more symmetry equivalents for outlier rejection during scaling and possibly remove model bias resulting from introducing too many free atoms during the modelbuilding step.
Yet there are certain steps of the dataprocessing pipeline that would be wasteful to repeat. Since we already have an ensemble of models of the higher symmetry 2.4) and subsequent (§2.5.2).
it is no longer necessary to repeat the decision during autoindexing in which the and are chosen from a list of lattices compatible with the observed cell. Similarly, no phasing protocols should be required, as the structure of the atomic model and its position in the have adequately been addressed by the fast translation function (§An express route to rerefinement is achieved by adapting the autoindexing program labelit.index (Sauter et al., 2004) to accept the additional input of a PDB file containing one of the proposed ASU models from §2.5.3. Structure factors are calculated, taking into account a bulksolvent correction (Afonine et al., 2005) to more realistically model the observed intensities. Separately, data from one or two frames of the raw data are integrated and corrected for Lorentz and polarization factors (Leslie, 1999), using a preliminary reduced (GrosseKunstleve et al., 2004) to model the lattice. We now wish to determine how the unitcell basis vectors of the calculated and observed patterns need to be aligned in order to obtain the best fit between intensities. Two types of ambiguity need to be resolved. Firstly, in some cases the is close to fitting into a higher symmetry metric. A triclinic cell, e.g. with dimensions a ≃ b and α ≃ β, may require an axis swap (a′, b′, c′ = −b, −a, −c) to correctly model the observed pattern. Secondly, certain space groups permit multiple nonequivalent indexing schemes (Dauter, 1999), only one of which will allow the ASU model to align properly with the observations. For example, point groups 3, 4 and 6 can be indexed with the c axis up or down. All of these ambiguities can be resolved by exhaustively testing each possible reindexing scheme that preserves the unitcell dimensions, and assessing the mutual scaling R factor (Weiss, 2001) between calculated and observed intensities. The result is an indexing solution for the diffraction pattern that correctly accounts for the position and orientation of the ASU model in G. At this point the full data set is integrated, scaled and converted to structure factors. Structure is initiated (e.g. with phenix.refine) starting with the aforementioned ASU model. As shown in Table 1, the rerefinement of triclinic structure A3 in P4, without any further manual intervention, leads to a new structure (A4) that is comparable to the original published PDB file.
3. Results and discussion
A November 2009 snapshot of the PDB was analyzed to identify Xray structures that are nearly invariant when additional rotational symmetry operators are imposed. Of almost 62 000 files in the database, about 53 000 are Xray structures. Here, we focus on the approximately 52 000 that contain protein chains rather than exclusively ranks these candidates in order of increasing Δr_{sym} (a measure of the average C^{α} displacement required to impose the additional symmetry; see equation 8) up to an arbitrary cutoff value (see below) of Δr_{sym} = 0.325 Å. A full listing is given in Table S1^{1}.
or small About 1000 structures, or 2%, were conservatively found to produce a good fit with a higher symmetry Fig. 4It is beyond the scope of this paper to deliver a definitive choice as to which space groups are best for individual structures. However, if the conservative group of 1000 shown in Fig. 4 is considered as a whole, there are strong arguments to favor the higher symmetry settings. Foremost is the small size of the displacements needed to bring equivalent atoms into a perfectly symmetrical arrangement. It is generally recognized that the coordinate accuracy of an Xray structure is a fraction of the diffraction pattern's limiting resolution (Luzzati, 1952). Various methods are presently used to estimate the coordinate uncertainty (Kleywegt, 2000) and where reported in the PDB these 1σ uncertainty values are plotted in Fig. 4(a). Most of the estimated values shown (75%) are at least as high as Δr_{sym}. Generally speaking then, for this group, there is a good chance that displacements seeming to be a product of differences are really a result of experimental coordinate uncertainty.
This argument is made stronger by a considering whether the imposition of added symmetry requires random displacements of individual atoms or rigidbody motions of entire polypeptide chains. The quantity Δr_{ASU} (10) gives an indication of the random variations of equivalent atoms once the polypeptide chains are superimposed by a rigidbody motion (Δr_{chain} serves the same function for cases where there are multiple chains in the asymmetric unit). The plotted values in Fig. 4(a) demonstrate that for most cases Δr_{ASU} (or Δr_{chain} where appropriate) is nearly identical to Δr_{sym}; on average, all but 0.01 Å of the displacement required comes from individual atomic motions. The fact that there is virtually no rigidbody component is consistent with the idea that subunit differences are a consequence of experimental uncertainties rather than true observations of NCS variation.
A final and compelling factor supporting the higher symmetries is the distribution of observed structure factors, which are published in the PDB for 694 of the cases. The agreement of symmetryequivalent observed intensities is quite good under many of the higher symmetry operators, with values of R_{symop} (I^{obs}) clustering about an average of 4% (Fig. 4b). The fact that the reported merging R values from these same 694 structures have an average of 8% suggests that any observed differences in symmetryequivalent intensities is not experimentally significant.
Taken together, the data in Fig. 4 are evidence that a considerable number of PDB structures could be reassigned to higher symmetry space groups. Reassigning the would reduce the number of polypeptide chains in the model by a factor of n, where n = 2 for most cases but in some cases is found to be 3, 4, 6 or even 12 (Table 5). It is not apparent whether the reassignment candidates have any particular properties in common, e.g. they seem to be distributed over the entire range of limiting resolutions represented in the PDB. Furthermore, all point groups for which supergroups are available are present in the list (Tables 6 and S1).


Care should be taken to distinguish between the present results and a previous study by Wang & Janin (1993) showing that NCS symmetry axes tend to lie nearly parallel to unitcell edges or face or body diagonals. The vast majority of structures listed by Wang and Janin are likely to have correctly classified space groups, with verifiable differences between NCSrelated subunits. None of the cases listed in that paper appear in our list of candidates for reclassification (Table S1).
The choice of Δr_{sym} = 0.325 Å as a cutoff for producing Fig. 4 and Table S1, while arbitrary, reflects the notion that larger values of Δr_{sym} and Δr_{chain} are more likely to exceed the expected coordinate uncertainty, implying confident rather than underassigned symmetry. It is instructional to consider how these latter categorizations relate to the mathematical treatment of §2.5.1: with underassigned symmetry the representatives g_{2}…g_{n} are exact symmetry operators leaving the structure invariant, while with these operators match atoms in the in an approximate rather than an exact fashion. Furthermore, with there is the attendant possibility of (Padilla & Yeates, 2003), in which the representatives act as operators that describe the mutual relationship of different unit cells in the crystal. The R_{symop}(g_{i}) values obtained from (1) correspond to the R_{twin} formula defined by Lebedev et al. (2006), suggesting a role for the R_{symop} (I^{calc}) and R_{symop} (I^{obs}) statistics in quantifying as discussed in that reference.
Ideally, any validation process to prepare structures for final publication and deposition should scrutinize the choice of
Normally the compatible Bravais lattices are evident at the stage of autoindexing, when the observed unitcell dimensions are checked for higher symmetry metrics. Subsequently, at the step of dataset merging, it is usually possible to unambiguously identify the of the diffraction pattern. Yet the data shown here indicate that a fraction of cases are misassigned, suggesting that a third check should be added at a later step, after the atomic model is built.It is fair to ask how beneficial such a procedure would be. In the unusual but ideal situation in which the data are very accurately measured and there is an ample datatoparameter ratio, it should be possible to obtain an accurate structure even if the symmetry is underassigned. However, in more typical cases in which the desired atomic details may be only marginally observable in the electrondensity map, the constraints offered by perfect symmetry may be crucial to map interpretation. Much of crystallography today is centered on elucidating the relationship between proteins and smallmolecular ligands, including ions, 2.5.3) is intended to assist the crystallographer in determining whether there are regions of the model that may exhibit especially large changes under the proposed symmetry target and which therefore warrant extra attention.
drugs and small and the models for these interactions may not be as well restrained by stereochemistry as those of proteins. Assignment into a higher symmetry may prove helpful in borderline cases where it is barely possible to discern the ligand. The ability to align symmetryequivalent models arising from spacegroup reassignment (explained in §The spectre of reevaluating the space groups assigned to hundreds of crystal structures calls to mind recent discussions regarding the worth of archiving original crystallographic diffraction images (see, for example, Baker et al., 2008). If the objective is to justify a certain choice of symmetry to future investigators, then data archival assumes a new importance.
The procedures described here are included in the software package LABELIT, available for download by noncommercial users at https://cci.lbl.gov/labelit and for licensing by commercial users. Commandline parameters for the program labelit.check_pdb_symmetry, explained in the online manual, permit the input of both coordinates and structure factors. LABELIT is also included with the PHENIX package (Adams et al., 2002), available for download at https://www.phenixonline.org.
Supporting information
Supporting information file. DOI: https://doi.org/10.1107/S0907444910001502/dz5193sup1.pdf
Acknowledgements
The authors would like to thank Ashley Deacon (Joint Center for Structural Genomics) for creating the archive of full data sets associated with published JCSG structures, making it possible to develop new methods such as those described here. Karen Woo (Lawrence Berkeley National Laboratory) provided invaluable technical assistance. We thank the NIH for financial support of the LABELIT (1R01 GM77071) and PHENIX (1P01 GM63210) projects and for additional support to PHZ (Y1GM906411). This work was partially supported by DOE contract No. DEAC0205CH11231.
References
Afonine, P. V., GrosseKunstleve, R. W. & Adams, P. D. (2005). Acta Cryst. D61, 850–855. Web of Science CrossRef CAS IUCr Journals Google Scholar
Afonine, P. V., GrosseKunstleve, R. W., Adams, P. D., Lunin, V. Y. & Urzhumtsev, A. (2007). Acta Cryst. D63, 1194–1197. Web of Science CrossRef CAS IUCr Journals Google Scholar
Adams, P. D., GrosseKunstleve, R. W., Hung, L.W., Ioerger, T. R., McCoy, A. J., Moriarty, N. W., Read, R. J., Sacchettini, J. C., Sauter, N. K. & Terwilliger, T. C. (2002). Acta Cryst. D58, 1948–1954. Web of Science CrossRef CAS IUCr Journals Google Scholar
Baker, E. N., Dauter, Z., Guss, M. & Einspahr, H. (2008). Acta Cryst. D64, 337–338. Web of Science CrossRef IUCr Journals Google Scholar
Berman, H., Henrick, K. & Nakamura, H. (2003). Nature Struct. Biol. 10, 980. Web of Science CrossRef PubMed Google Scholar
Boisen, M. B. Jr & Gibbs, G. V. (1990). Mathematical Crystallography (Reviews in Minerology, Vol. 15), revised ed. Washington DC: Mineralogical Society of America. Google Scholar
Dauter, Z. (1999). Acta Cryst. D55, 1703–1717. Web of Science CrossRef CAS IUCr Journals Google Scholar
Evans, P. (2006). Acta Cryst. D62, 72–82. Web of Science CrossRef CAS IUCr Journals Google Scholar
Giacovazzo, G., Monaco, H. L., Vitergo, D., Scordari, F., Gilli, G., Zonotti, G. & Catti, M. (1992). Fundamentals of Crystallography. Chester, Oxford: IUCr/Oxford University Press. Google Scholar
GrosseKunstleve, R. W. (1999). Acta Cryst. A55, 383–395. Web of Science CrossRef CAS IUCr Journals Google Scholar
GrosseKunstleve, R. W., Sauter, N. K. & Adams, P. D. (2004). Acta Cryst. A60, 1–6. Web of Science CrossRef CAS IUCr Journals Google Scholar
GrosseKunstleve, R. W., Sauter, N. K., Moriarty, N. W. & Adams, P. D. (2002). J. Appl. Cryst. 35, 126–136. Web of Science CrossRef CAS IUCr Journals Google Scholar
GrosseKunstleve, R. W., Zwart, P. H., Afonine, P. V., Ioerger, T. R. & Adams, P. D. (2006). Newsl. IUCr Comm. Crystallogr. Comput. 7, 92–105. Google Scholar
Hahn, T. (1996). Editor. International Tables for Crystallography, Vol. A, 4th ed. Dordrecht: Kluwer Academic Publishers. Google Scholar
Hendrickson, W. A. (1985). Methods Enzymol. 115, 252–270. CrossRef CAS PubMed Google Scholar
Hirshfeld, F. L. (1968). Acta Cryst. A24, 301–311. CrossRef IUCr Journals Web of Science Google Scholar
Hooft, R. W. W., Sander, C. & Vriend, G. (1994). J. Appl. Cryst. 27, 1006–1009. CrossRef CAS Web of Science IUCr Journals Google Scholar
Hooft, R. W. W., Vriend, G., Sander, C. & Abola, E. E. (1996). Nature (London), 381, 272. CrossRef PubMed Web of Science Google Scholar
Jones, T. A. & Liljas, L. (1984). Acta Cryst. A40, 50–57. CrossRef CAS Web of Science IUCr Journals Google Scholar
Kearsley, S. K. (1989). Acta Cryst. A45, 208–210. CrossRef CAS Web of Science IUCr Journals Google Scholar
Kleywegt, G. J. (2000). Acta Cryst. D56, 249–265. Web of Science CrossRef CAS IUCr Journals Google Scholar
Kleywegt, G. J., Hoier, H. & Jones, T. A. (1996). Acta Cryst. D52, 858–863. CrossRef CAS Web of Science IUCr Journals Google Scholar
Koch, E. & Fischer, W. (1996). International Tables for Crystallography, Vol. A, 4th ed., edited by T. Hahn, pp. 855–869. Dordrecht: Kluwer Academic Publishers. Google Scholar
Lebedev, A. A., Vagin, A. A. & Murshudov, G. N. (2006). Acta Cryst. D62, 83–95. Web of Science CrossRef CAS IUCr Journals Google Scholar
Le Page, Y. (1982). J. Appl. Cryst. 15, 255–259. CrossRef CAS Web of Science IUCr Journals Google Scholar
Le Page, Y. (1988). J. Appl. Cryst. 21, 983–984. CrossRef Web of Science IUCr Journals Google Scholar
Leslie, A. G. W. (1999). Acta Cryst. D55, 1696–1702. Web of Science CrossRef CAS IUCr Journals Google Scholar
Luzzati, V. (1952). Acta Cryst. 5, 802–810. CrossRef IUCr Journals Web of Science Google Scholar
Marsh, R. E. (1995). Acta Cryst. B51, 897–907. CSD CrossRef CAS Web of Science IUCr Journals Google Scholar
Marsh, R. E. (1997). Acta Cryst. B53, 317–322. CSD CrossRef CAS Web of Science IUCr Journals Google Scholar
Marsh, R. E. (2009). Acta Cryst. B65, 782–783. Web of Science CSD CrossRef CAS IUCr Journals Google Scholar
Marsh, R. E. & Herbstein, F. H. (1988). Acta Cryst. B44, 77–88. CSD CrossRef CAS Web of Science IUCr Journals Google Scholar
Marsh, R. E. & Spek, A. L. (2001). Acta Cryst. B57, 800–805. Web of Science CSD CrossRef CAS IUCr Journals Google Scholar
McCoy, A. J., GrosseKunstleve, R. W., Adams, P. D., Winn, M. D., Storoni, L. C. & Read, R. J. (2007). J. Appl. Cryst. 40, 658–674. Web of Science CrossRef CAS IUCr Journals Google Scholar
Navaza, J. & Vernoslova, E. (1995). Acta Cryst. A51, 445–449. CrossRef CAS Web of Science IUCr Journals Google Scholar
Padilla, J. E. & Yeates, T. O. (2003). Acta Cryst. D59, 1124–1130. Web of Science CrossRef CAS IUCr Journals Google Scholar
Palatinus, L. & van der Lee, A. (2008). J. Appl. Cryst. 41, 975–984. Web of Science CrossRef CAS IUCr Journals Google Scholar
Sauter, N. K., GrosseKunstleve, R. W. & Adams, P. D. (2004). J. Appl. Cryst. 37, 399–409. Web of Science CrossRef CAS IUCr Journals Google Scholar
Sauter, N. K., GrosseKunstleve, R. W. & Adams, P. D. (2006). J. Appl. Cryst. 39, 158–168. Web of Science CrossRef CAS IUCr Journals Google Scholar
Spek, A. L. (2009). Acta Cryst. D65, 148–155. Web of Science CrossRef CAS IUCr Journals Google Scholar
Terwilliger, T. C., GrosseKunstleve, R. W., Afonine, P. V., Moriarty, N. W., Zwart, P. H., Hung, L.W., Read, R. J. & Adams, P. D. (2008). Acta Cryst. D64, 61–69. Web of Science CrossRef CAS IUCr Journals Google Scholar
Urzhumtseva, L., Afonine, P. V., Adams, P. D. & Urzhumtsev, A. (2009). Acta Cryst. D65, 297–300. Web of Science CrossRef CAS IUCr Journals Google Scholar
Wang, X. & Janin, J. (1993). Acta Cryst. D49, 505–512. CrossRef CAS Web of Science IUCr Journals Google Scholar
Weiss, M. S. (2001). J. Appl. Cryst. 34, 130–135. Web of Science CrossRef CAS IUCr Journals Google Scholar
Zwart, P. H., GrosseKunstleve, R. W. & Adams, P. D. (2005). CCP4 Newsl. 43, contribution 7. Google Scholar
Zwart, P. H., GrosseKunstleve, R. W., Lebedev, A. A., Murshudov, G. N. & Adams, P. D. (2008). Acta Cryst. D64, 99–107. Web of Science CrossRef CAS IUCr Journals Google Scholar
This is an openaccess article distributed under the terms of the Creative Commons Attribution (CCBY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.