research papers
Intensity statistics in twinned crystals with examples from the PDB
^{a}Structural Biology Laboratory, Department of Chemistry, University of York, Heslington, York YO10 5YW, England
^{*}Correspondence email: lebedev@ysbl.york.ac.uk
Entries deposited in the Protein Data Bank as of February 2004 for which both model and Xray data were available were analysed to identify cases of R factor between potential twinrelated reflections. Careful consideration of all identified twins showed that in many cases was ignored during structure solution and Manual analysis of the models showed that often occurs in association with rotational parallel to the operator. The coexistence of these two phenomena complicates the detection and diagnostics of using currently available tests. It was concluded that a twinningdetection step should be incorporated in every stage of structure analysis from data acquisition to and validation.
using such simple statistics as theKeywords: twinning; Protein Data Bank; intensity statistics.
1. Introduction
The Protein Data Bank (PDB; Bernstein et al., 1977; Berman et al., 2002) is a rich source of biological, biochemical and structural information. It also offers templates for the determination of new structures by The huge number of models with experimental Xray data provides numerous training cases of varying difficulties useful to both the practical crystallographer and software developers. These cases should be analysed before approaching reallife difficult cases and, in an ideal world, all new software should be tested against them before general release.
However, one should be careful when extracting information from the PDB because of several problems, some of which have been described by Kleywegt (1999, 2000). Currently, a new entry goes through a careful validation procedure during deposition. Nevertheless, at least one potential problem, (Giacovazzo et al., 1992), has not yet been addressed. tests should be included in the validation routine and, when is present, the structurefactor analyses need to be adjusted accordingly. If is not taken into consideration during the resulting model will inevitably be degraded. Therefore, during deposition, it is important to notify the depositor if this is the case.
The ). For smallmolecule structures, data collection and processing, structure solution and against data from twinned crystals are routine (Sheldrick & Schneider, 1997). However, the situation with macromolecules is not yet so straightforward; the software used for dataacquisition and structuresolution procedures have not addressed this problem fully. For example, it is particularly difficult to solve a twinned structure using experimental phasing (Dauter, 2003; Rudolph et al., 2003).
phenomenon in crystals has been recognized for a long time (Friedel, 1926The phenomenon of et al. (2005) in which the intergrown domains have different space groups.] However, diffraction patterns from such intergrown crystals are often deceptive; if the diffraction spots from the two (or more) crystal domains completely overlap, the diffraction pattern will appear normal on initial inspection. In this case the measured observation at a given reciprocallattice point is in fact the sum of the twinned sets of intensities, weighted by the relative volumes (`twinning fractions') of the different components. This is called (pseudo)merohedral and the term `twinning' in macromolecular crystallography usually refers to this. The most common case seen in macromolecular crystallography is hemihedral in which there are only two crystal components related by a twofold operator. However, the situation can be more complicated, as demonstrated by Barends et al. (2005).
should be considered as a special case of crystal intergrowth. Crystal clusters are often observed, but usually it is possible either to optimize crystallization to grow a single crystal or to break off a singlecrystal fragment. In some cases this simple approach does not work and one has to deal with diffraction data from an intergrown crystal where the diffraction patterns of two or more fragments overlap. If the fragments are orientated in a random manner relative to each other the two lattices can be identified from the first images. [A very interesting case of treatment of such data has been reported by DauterFor two or more lattices to overlap completely, the unitcell parameters and crystal symmetry must possess some special relationships. The unitcell parameters must allow the possibility of higher symmetry than the crystal actually shows. This is most common in tetragonal, trigonal or cubic crystal classes, where the et al., 1992). A technique for identifying data sets where the unitcell parameters and can allow (pseudo)merohedral and finding the possible operators is described by Flack (1987).
operator will be one of the symmetry operators of the However, it is also possible for triclinic, monoclinic and orthorhombic crystals when the unitcell parameters possess some special properties (GiacovazzoIn these cases, to detect whether i.e. using what is at hand, is becoming more common, with new software being developed to meet this demand.
has occurred requires statistical analysis of the whole data set. When a problem is detected two options are open: (i) to discard the data set and try to obtain a new untwinned crystal or (ii) try to solve and refine the structure using the twinned data. While the first option seems to offer better data, it may turn out to be timeconsuming (or even impossible). Moreover, structural genomics imposes strong constraints on the time spent on an individual protein and option (ii),This contribution analyzes the PDB to find out how often such a problem occurs and to generate ideas for the future automatic treatment of structures using data from twinned crystals. We also describe the major difficulties we faced in the identification of
in special cases using the widely available tests.2. Materials and methods
The PDB February 2004 release containing about 22 000 structures was screened and only those entries where both coordinates and structures factors had been deposited (11 367 entries) were used in the analysis. The unitcell parameters and A. If (pseudo)merohedral was possible then this data set was selected for further analysis. 5% deviation from ideal constraints was allowed. This threshold is consistent with Mallard's rule as cited by Grimmer (2003). If observed intensities were present in a file they were used directly and for other applications they were converted to structure factors using TRUNCATE (French & Wilson, 1978). If only observed structure amplitudes were available, estimates of the corresponding intensities were generated, although some information must be lost.
of these entries were analysed using the technique described in AppendixThus, in all selected cases there is at least one potential R_{twin}, defined in (1), was calculated with respect to each operator for both observed intensities and those calculated from the atomic model. The matrix for a potential operator, selected from the of equivalent operators, and the associated R_{twin }^{obs } and R_{twin}^{calc } were calculated using a program written by one of us (AAL).
operator.The distribution of R_{twin}^{obs} against R_{twin}^{calc}, referred to as an RvR plot and discussed below, can give a clear indication of Detailed analysis was carried out for all likely twinned structures. The analysis involved estimation of the likely number of molecules in the using SFCHECK, selfrotation function (Rossmann & Blow, 1962) as implemented in MOLREP (Vagin & Teplyakov, 1997), tests based on overall reflection statistics, namely cumulative distribution of normalized intensity and moments of acentric reflections (Rees, 1980) as implemented in TRUNCATE, and Htests (Yeates, 1997) as implemented in SFCHECK (Vaguine et al., 1999). If the interpretation of these tests was ambiguous, then using MOLREP and using REFMAC (Murshudov et al., 1997) were carried out using the model from the PDB without substrates and with all atomic displacement parameters (ADP) reset to equal values. The models, Patterson and electrondensity maps were visualized using Coot (Emsley & Cowtan, 2004). Further statistical analyses of the results were performed using the statistical package R (R Development Core Team, 2004). Some figures in this paper are based on those generated from CCP4 software (Collaborative Computational Project, Number 4, 1994).
3. RvR plot
Detection of R_{merge}) statistical properties of the data are too ambiguous for assignment of crystal symmetry and detection of prior to the structure determination.
should ideally be performed at the stage of data acquisition before the is known. This task is not always trivial; for example, perfect cannot be detected from merging statistics. In some instances, even finer (thanTherefore, we undertook an investigation of all possible
cases, known or undetected, deposited in the PDB. The goal of the work was to understand the symmetry environments most frequently accompanying and to pinpoint problems with its detection. Since for these data sets both the atomic model and the experimental data are available, the analysis is considerably simplified.3.1. R factor with respect to operator
Let us assume that in a given crystal the combination of S_{twin}. It can be determined using, for example, the technique described in Appendix A.
and crystal symmetries allows This means that there is at least one potential operatorLet R_{twin} be the intensitybased R factor between reflections related by potential operator S_{twin},
where summation is over all unique reflections h, such that intensities for both h and h′ = S_{twin}h have been measured and h ≠ h′. The definition of R_{twin} (1) is similar to that of R_{sym}, except that S_{twin} is not an operator of the crystal but belongs to the of the crystal lattice.
R_{twin}^{obs} and R_{twin}^{calc} are R_{twin} calculated using observed intensities and (untwinned) intensities derived from the atomic model, respectively. The relationship between the two magnitudes is as follows (see also Appendix B)
The approximation sign in the equations above is a consequence of model and experimental errors. Our experience shows that in the majority of the cases these errors do not affect the qualitative conclusions.
If the crystal symmetry has been misspecified^{1} then the analysis of unitcell parameters and identifies missing elements of the of the crystal as operators. In this case it is expected that both
Note that a small value of R_{twin}^{obs} can be misinterpreted, as this takes place in two different cases; see (1) and (2). In particular, false positives in detection of can be found in some PDB entries with misspecified symmetry (see §3.4.2).
3.2. interfering with NCS
Let a crystal or an individual crystal of a twin possess
(NCS) and let one of the NCS operators be such that its rotational component is approximately equal to the (potential) operator. In this case, the NCS could interfere with and is further referred to as rotational (RPS). There are two reasons why twins with RPS are of special interest.Firstly, a correlation between observations related by potential R_{twin}^{obs} alone. These cases are particularly difficult for detection prior to the structure determination.
operators could be caused either by RPS or by both RPS and The two cases cannot be discriminated bySecondly, we expect a relatively high frequency of twins with RPS because of high likelihood of the following two mechanisms of their formation. The first mechanism assumes a change of crystal symmetry (we are interested in symmetry reduction) which is sometimes observed during crystallization, seeding, soaking, fast cooling and even data collection. It is physically reasonable to expect that this transition starts simultaneously in several areas of the crystal. As a result, several identical domains are formed in two or more different orientations related by the broken ; Cochran & Howells, 1954).
and thus the crystal becomes twinned and the broken becomes a operator. At the same time it becomes an RPS operator, relating molecules in the which were equivalent by crystal symmetry before the transition. In the second mechanism an individual crystal is formed by tightly packed molecular layers with symmetry that is higher than that of the interfaces between them. In these structures the whole layers are (approximately) invariant with respect to NCS operators. Consequently, any NCS operator in the layer at a interface relates two domains and thus the NCS is RPS. The high frequency of interfaces in such symmetry environments leads to statistical crystals (Bragg & Howells, 1954If there is no NCS, no pronounced anisotropy and no serious experimental errors, the expected value of R_{twin}^{calc} can be estimated to be 0.5 as shown in Appendix B. However, when RPS is present, the correlation between related reflections causes a decrease in R_{twin}^{calc}. Thus, in addition to (2) the following holds,
Note that even if RPS is present, (2) holds. Thus, despite the similar effects of RPS and on R_{twin}^{obs}, the availability of a crystal model in principle allows us to distinguish between RPS and interfering with RPS using such simple statistics as R_{twin}^{obs} and R_{twin}^{calc}.
The relations (2)–(4) are illustrated in Fig. 1(a). The figure shows areas corresponding to different combinations of RPS and twinning.
3.3. Selection of cases
The simplest possible way to select
cases from the PDB would be to extract the relevant information from the PDB headers and/or related papers. However, this approach is not sufficient because the researchers depositing data and/or writing papers either may have not noticed or not discussed (false negatives) or may have misinterpreted higher crystal symmetry as (false positive). Therefore, it was decided to analyse PDB entries directly. This direct approach may also lead to a better understanding of the problems with the detection of twinning.We analysed unitcell parameters and the reported crystal symmetry of 11 367 entries present in the PDB at February 2004 containing both an atomic model and Xray data. Entries where R_{twin} ^{obs} and R_{twin}^{calc} were computed. If there were two or more (nonequivalent) potential operators (as, for example, in P3), then that which gave the lowest value of R_{twin}^{obs} was chosen. Thus, each selected entry was characterized by only two quantities, R_{twin}^{obs} and R_{twin}^{calc}, and the corresponding point was drawn on the plot of R_{twin}^{obs} versus R_{twin}^{calc} (RvR plot; see Fig. 1b).
is impossible or where the data were corrupted and unreadable by our software were rejected from further consideration. For the remaining 4086 entries, potential operators were determined andFor each structure represented in Fig. 1(b), we analysed whether the if present, is or pseudomerohedral using the technique described in Appendix A. The points in Fig. 1(b) are coloured according to the results of this analysis.
All cases belonging to `twinning areas' in the RvR plot (Fig. 1a) were analysed in detail to validate the presence or absence of and to characterize the NCS if present. The specific areas and some peculiarities of the RvR plot are discussed below.
3.4. Observed RvR plot
3.4.1. Main cluster
A large cluster around (0.5, 0.5) corresponds to untwinned crystals with no pronounced P3_{x}, P3_{x}21, P3_{x}12, P4_{x}, I4_{x}, P6_{x}, P2_{x}3, I2_{x}3 and F23 belong to this area. Since the lattices of these space groups have higher rotational symmetry than that of crystals, no extra constraints on the unitcell parameters are needed for to occur (Giacovazzo et al., 1992; Schlessman & Litvin, 1995).
However, is not forbidden by the unitcell parameters and and could occur for related crystals. Some of these points correspond to data sets which were detwinned before deposition. These cases were not included in further analysis. In particular, all untwinned crystals in space groups3.4.2. Misspecified crystal symmetry
The cluster at the origin corresponds to structures in which the crystal symmetry is misspecified and is actually higher than that used in the R_{twin}^{obs} and R_{twin}^{calc} are expected to be close to 0.0. Several randomly chosen cases from this cluster have been successfully refined in the higher symmetry It is interesting to note that for two of them was reported, presumably on the basis of the low R_{twin}^{obs}; these are examples of false positives.
and reported in the PDB entry. Thus, bothThe first reliable case of R_{twin}^{calc} = 0.2. At the same time, in some cases where the was misspecified R_{twin}^{calc} goes up to 0.3; these are mainly lowresolution structures where it is easy to overfit the model and to generate significant differences between independently modelled symmetryrelated molecules.
has3.4.3. RPS
The lower tail of the main cluster corresponds to untwinned crystals with RPS. Most points in this area are located on the diagonal, with R_{twin}^{calc} ≃ R_{twin}^{obs} in the range 0.35–0.4. This tail extends along the diagonal down to about 0.2. Here we find one of the most extreme examples, the untwinned 1i1j (Lougheed et al., 2001), in which the rootmeansquare deviation of C^{α} atoms from the positions corresponding to higher crystal symmetry is about 0.15 Å.
3.4.4. Translational (TNCS)
The main cluster has an upper diagonal tail around (0.6, 0.6) corresponding to structures with TNCS, in which the set of TNCS vectors and consequently the modulation of intensities in the S_{twin}. Thus, the intensities related by S_{twin} are modulated differently and the assumptions required for the relation (10) in Appendix B to be valid are violated. The numerator in (1) for both R_{twin}^{obs} and R_{twin}^{calc} increases, increasing their values.
caused by TNCS are not invariant with respect toAmong such structures we observed no cases of
(note the empty area below this cluster in the RvR plot).3.4.5. Mislabelled and corrupt data
There are some extra features on the RvR plot arising from mislabelling of columns in the (b), which are just above and below the main cluster, are worth mentioning. In the first one, at about (0.5, 0.4) on the RvR plot, the structure amplitudes are present in the file but are labelled as intensities. In the second one, at about (0.5, 0.6), the intensities are labelled as structure amplitudes. These peculiarities may in theory be identified by simple statistical techniques. However, if such factors as or anisotropy affect the data or several deposition inaccuracies (for example, deposition of the detwinned instead of the measured data) are present simultaneously then such analysis becomes complicated, if possible.
file. Two small clusters shown by circles in Fig. 13.4.6. Areas of the RvR plot indicating is likely
The points below the diagonal should, at least in theory, correspond to twins. In particular, points that deviate from the diagonal with R_{twin}^{calc} significantly less than 0.5 should correspond to twins with RPS (see §3.2) and with an adequate protocol, the deviation from the diagonal should correlate with fraction.
The cases with R_{twin}^{calc} in the range 0.2–0.6 and R_{twin}^{obs} in the range 0.0–0.3, as well as some randomly chosen cases from other areas, were further investigated (coloured circles in Fig. 1c). The protocol of analysis included validation of the model, various tests performed with both observed and calculated intensities with different resolution cutoffs and characterization of the NCS. In particular, NCS operators if present were compared with potential operators to identify RPS. If there was spatial attempts were made to refine structures in the corresponding higher symmetry to ensure that the reported crystal symmetry was correct.
The rectangular area of the RvR plot under consideration overlaps with the areas discussed above and therefore includes a number of other data sets which turned out to be untwinned but which had special `features' such as misspecified symmetry, mislabelled structure amplitudes or which displayed RPS and therefore lay on the diagonal with both R_{twin}^{obs} and R_{twin}^{calc} below 0.5.
78 cases of (c) with red and green circles corresponding to the presence and absence of RPS interference, respectively. These cases are further discussed in the next subsection.
have been identified, verified and characterized. They are marked in Fig. 13.5. cases
All the structures and their data belonging to the a and 1c) were analysed in detail to identify actual cases. Table 1 contains symmetry and NCS information for the 78 cases identified with a high degree of confidence. NCS for DNA structures was not analysed.
areas of the RvR plot (Figs. 1

There are two features of this table that are worth mentioning. Firstly,
is not unusual. Secondly, the cases where interferes with RPS are more frequent than simple especially for the pseudomerohedral twins.One of the important practical conclusions from these analyses is as follows. If the perfect P422, it is not necessary, neither theoretically nor in practice, that the pointgroup symmetry of the individual crystal is P4 and that the is generated by twofold axis orthogonal to crystallographic fourfold. For example, the crystal symmetry of 1upp (Karkehabadi et al., 2003) is C222_{1} and there are two domains, with one possible choice of operator being a fourfold axis along one of the crystallographic twofold axes.
tests show the presence of it does not necessarily mean that the is Thus, even if data from a perfect twin have pointgroup symmetry3.6. False negatives
It is interesting to note that only one third of the cases identified as twins by our analysis were reported as such in the PDB submission, although in some of these cases analysis of intensities derived from atomic models shows that the
was actually taken into account during Nevertheless, in a significant number of cases this was not done.The effect of ignoring R_{twin}^{calc} is illustrated by the following simulated experiment. The 3.1 Å data from an untwinned crystal were artificially twinned to produce six data sets with fractions of 0.0, 0.1, …, 0.5. The model from the PDB was refined against all data sets following the same protocol, without model rebuilding and ignoring R_{twin}^{obs} and R_{twin}^{calc} were computed for these six data sets and for the intensities calculated from the appropriate `refined' models. The result is shown as the central blue curve in Fig. 1(d). If had been properly taken into account during then R_{twin}^{calc} would remain constant throughout all these refinements (vertical green line on the right in Fig. 1d). Note that `incorrect refinements' have been carried out starting from the correct model. Even in these cases the points clearly drift towards the left on the RvR plot. Since reallife solution requires many cycles of alternated with model building, it is anticipated that this drift to the left is much more serious than in this simulation. To analyse this trend further `refinements' were carried out with relaxed restraints on ADPs. The results are plotted in red in Fig. 1(d) and show further reduction of R_{twin}^{calc}.
onThis simulated experiment helps to explain why only some of the c) show R_{twin}^{calc} ≃ 0.5. In all of them proper accounting for has been performed. In some cases without RPS, R_{twin}^{calc} is significantly less than 0.5 and, judging by the simulated results shown in Fig. 1(d), we expect that the protocol was not adequate.
cases without RPS (green points in Fig. 1The above simulated experiment is one of the cases where the socalled `model bias' arises because of an insufficient number of parameters and the addition of only one extra parameter, the
fraction, would substantially reduce it. Generally speaking, model bias is not so much a consequence of a large number of parameters, but of incorrect parameterization; bias is best corrected by reparameterization of the model rather than by removing a part of it.4. Performance of tests
During the verification of Htest.
in the cases selected using the RvR plot a number of problems were encountered. Some of these problems are of a general nature and are worthy of special attention. This section discusses the influences of experimental error and on perfect tests and the influence of RPS on one particular partial test, the4.1. Effect of experimental errors on the perfect tests
In the perfect Z. Two different distributions are considered, one for centric and one acentric reflections. Derivation of these distributions (Rees, 1980) is based on the Wilson distribution of structure factors (Wilson, 1949) for untwinned crystals. Two of the major tests are based on comparison of the theoretical and observed curves of (i) the cumulative distribution of Z versus Z and (ii) the second moment of Z versus resolution, shown in Figs. 2, 3 and 4.
tests, the observed intensities normalized within resolution shells are assumed to be sampled from the onedimensional distributions of the random variableIt is necessary to use sensible resolution cutoffs to be able to draw any reliable conclusions from these tests. The reason for this is that highresolution reflections as a rule have larger experimental errors, but the theory does not take these into account.
Our experience shows that the lowresolution cutoff is not necessary; however, it is important to remove highresolution data, where R standard = 〈σ(F)〉/〈F〉 starts growing and/or where a large variation of the second moment of Z for acentric reflections is observed. The required plots are available from various software, e.g. TRUNCATE and SFCHECK. An example of how this rule of thumb works is shown in Fig. 2. In this example, the experimental cumulative distribution of Z clearly indicates perfect with a highresolution cutoff at 2.2 Å. In contrast, the same test but with all data is misleading and the experimental curves are close to the theoretical curves for untwinned crystals.
It is important to emphasize that highresolution reflections do contain useful information about the structure despite a resolution cutoff being needed for some applications.
4.2. Effect of RPS on the perfect tests
Our experience shows that RPS affects perfect
tests only in the presence of when it partially compensates for the effect of twinning.The following numerical experiment illustrates this effect. The 1i1j Xray data set represents an untwinned crystal with RPS. The data set with perfect was simulated from the original untwinned data. All parameters except the fraction are the same in the two data sets. The experimental second moment of Z versus resolution and the cumulative distribution of Z are shown in Fig. 3. The experimental curves for the original data set match theoretical predictions (Figs. 3a and 3b); however, this is not so for the simulated data set, where only a marginal deviation from the theoretical curves for untwinned data towards those for perfect twins is observed (Figs. 3c and 3d). This example demonstrates that the simple theory based on Wilson's distribution assuming uniform distribution of atoms in the fails for twins with RPS.
Such behaviour can intuitively be understood by imaginary traversing of the RvR plot (Fig. 1). If we travel from the main cluster at (0.5, 0.5) towards the origin along the diagonal, we start from a point without anomalies of any kind and finish at the point where crystal symmetry is higher than reported symmetry but also with no anomalies. At both ends we have untwinned data statistics and the same statistical distributions could be expected all along the diagonal pathway (where untwinned data sets with RPS are located). Another limiting path is from the point (0.5, 0.0), below the main cluster, towards the origin along the abscissa. On this path the transition occurs from perfect twin statistics to untwinned statistics. The above example with simulated is located on this path at a point where this transition is almost accomplished.
This behaviour of perfect ; see also an example in Dauter et al., 2005). The closer the NCS operator generating RPS comes to an operator of higher the less contrast there is between the results of perfect tests for untwinned and twinned data. This lack of contrast creates difficulties for diagnostics. Fortunately, the atomic structure in such circumstances can frequently be solved and refined to a first approximation in a higher symmetry and when the sticks at an unreasonably high R factor, the structure can be resolved and further refined in the correct In this scenario problems with uncertain diagnosis are avoided, but it is necessary to collect data in the lower symmetry and to keep them unmerged. Reliable diagnosis of therefore becomes an important component of both data collection and refinement.
tests has been observed in a number of real cases where RPS interferes with (Table 1The effect of RPS on intensities decreases and therefore effect of ).
becomes more pronounced in higher resolution shells, where the intensities are affected by the small difference between NCSrelated molecules. However, as noted above the data in higher resolution shells are less reliable for tests because of experimental errors (see §4.14.3. TNCS and twinning
Table 1 shows that when and TNCS coexist the third ingredient, RPS, is usually also present (see, for example, PDB entry1upp ; Karkehabadi et al., 2003). This is not surprising because the reduction of symmetry resulting in must involve reduction of crystal i.e. formation of RPS.
In these structures higher pointgroup symmetry and shorter crystallographic translations can be accommodated by small, sometimes less than 1 Å, displacements of atoms. Thus, the modulation of intensities in the ).
caused by TNCS can be considered in terms of sublattices with different mean intensities (pseudocentering). Note that in these structures the sublattices are invariant with respect to operator (compare with §3.4.4The modulation of intensities owing to TNCS has an effect on the perfect e.g. the second moment of Z for acentric reflections becomes greater than two in the absence of The effect becomes stronger when the deviation from higher crystal symmetry decreases. Demodulation (normalization accounting for TNCS) or examination of the separate sublattices may reduce the effect of TNCS, but the effect of RPS remains. For different sublattices, the effect of the RPS is different and depends differently on the deviation from higher crystal symmetry, but it always partially compensates for the effect of twinning.
tests which is opposite to that of This effect is present in both twinned and untwinned crystals,The analysis of the data in terms of sublattices can be avoided by using the perfect ) and implemented in DATAMAN (Kleywegt & Jones, 1996). With a proper resolution cutoff this test indicates perfect but the contrast is less than that theoretically predicted (Padilla & Yeates, 2003). The presence of RPS in most twins with TNCS explains this observation, as the effect of RPS partially compensates the effect of even if the modulation of intensities owing to TNCS is accounted for.
test suggested by Padilla & Yeates (20034.4. Partial tests
There are two partial ) and the Htest (Yeates, 1997). In these tests the data are assumed to be processed in the correct group or its and not `overmerged' in a higher group. These tests are applied to a given potential operator suggested by the crystal and symmetries. Therefore, all nonequivalent operators (e.g. there are two of them in P3) have to be tested individually. Unfortunately, neither of these tests can distinguish between higher symmetry and perfect Nevertheless, in the case of partial they both indicate and estimate the fraction. We discuss the Htest in more detail with an emphasis on the effect of on its behaviour.
tests most frequently used in macromolecular crystallography, the Britton test (Britton, 1972In the Htest the joint twodimensional distribution of the intensities related by potential operator is of interest. Thus, extra information is used compared with the previously discussed tests, which are based on onedimensional distribution of intensities derived from the Wilson distribution. The idea of this test is that the cumulative distribution P(H) of a random variable H is a straight line over the whole range of possible H. In theory, which unfortunately does not take account of the effects of RPS, the linearity holds for both twinned and untwinned data and the slope of the plot P(H) versus H depends on the fraction (blue lines in Fig. 4a).
In the original version of the Htest the linearity is essential, as the fraction is estimated from mean value of H. However, theoretically impossible difference between intensities related by operator may appear as a result of radiation damage to the crystal if there was a long time interval between the two measurements. Too large differences could also occur if the Xray beam was focused at different parts of the crystal during these two measurements. The presence of such outliers distorts the experimental distribution of H at larger H and causes nonlinearity as in Fig. 4(a). This type of nonlinearity is typical of twins without interfering RPS, although the range of H where the plot deviates from the straight line varies. Such cases can be treated by a modified Htest in which the fraction is estimated using the slope of the plot at the origin (Yeates & Fam, 1999).
If RPS interferes with Htest are affected and the cumulative distribution of H becomes nonlinear over the whole range of the argument (Fig. 4b) and both versions of Htest fail to give a reasonable estimate of the fraction. In cases similar to that in Fig. 4(b), however, the fraction can be estimated from the value of H at the point where the experimental curve approaches the line P(H) = 1. In this formulation the Htest is equivalent to the Britton test. A disadvantage of such a formulation is that the estimate of fraction is based on the right tail of the distribution, which can be seriously corrupted by the effects of experimental errors mentioned above (see Fig. 4c). Thus, further improvement of the test can only be achieved by accurate modelling of the effect of RPS, TNCS and anisotropy and by accounting for outliers, while keeping the advantage of the original version of the Htest in which the whole data set is utilized.
then all pairs of reflections involved in the5. Conclusions
This analysis of the PDB shows that combinations of crystal and
symmetries can allow in more than 30% of cases, with both and pseudomerohedral cases widespread. For easy identification of the RvR plot was designed, which utilizes both observed intensities and intensities derived from the atomic models. Careful analysis of suspected twins identified from this plot has flagged 78 cases with a high degree of confidence. However, since of the atomic model ignoring causes model bias and thus distorts this picture, we expect there may be more actual twins. Moreover, since is one of the factors that often prevents structure solution, there are almost certainly many cases of that have not been fully analysed and deposited in the PDB.Analysis of all the identified cases showed that RPS coexists with
more frequently than we expected, affecting the intensity distributions and thus increasing the difficulty of detecting and hence the analysis of the structure. We found that all tests can fail to give convincing results. The situation becomes even more serious when TNCS is added to the picture. Ideally, one should also consider other crystalgrowth anomalies, such as statistical crystals, nonmerohedral and split crystals.As a result of indepth analysis of the identified
cases, we arrived at the conclusion that it is important to check for this and other crystalgrowth anomalies at every stage of structure analysis: starting from data acquisition and ending with and validation. To do this correctly, it is important to build a model accounting for various `abnormalities' and utilizing all the information available up to the current stage. For example, at the datacollection stage an awareness of may help to choose the correct strategy; during proper modelling of this phenomenon can reduce the noise in the electron density and hence help to reveal finer details of the molecular structure.APPENDIX A
A1. Algorithms used in the determination of operators and their type of merohedry
Several authors (Flack, 1987; Le Page, 2002; Grimmer, 2003) have already described the automatic identification of potential operators using unitcell parameters and A necessary step in all these algorithms is reducing the cell to a minimum either a Buerger or Niggli cell (see, for example, Mighell & Rodgers, 1980, and references therein). Here, we describe an algorithm designed by one of us (AAL) and implemented in a set of routines.
Given unitcell parameters and crystal symmetry, potential G, a group of 3 × 3 matrices with elements from {−1, 0, 1}. The set of 504 matrices with elements from {−1, 0, 1} of finite order with respect to matrix multiplication and with determinant equal to one is then generated. This set includes all operators of all rotational point groups expressed in fractional coordinates, provided that the with shortest edges is chosen. These operators are sorted according to the perturbation that they cause to the metric derived from the primitive unitcell parameters. The best 24 or less generators satisfying the perturbation threshold of 5% are then used sequentially according to the above sorting order to expand G to H, the rotational of the
operators are determined as follows. The cell is reduced to the with the shortest unitcell edges and the pointgroup operators derived from crystal symmetry are transformed accordingly to givewhere a is the tested generator. Expansion (5) is carried until a(Ga)^{n} contains no new elements or a new element is inconsistent with H being finite group. Any new consistent generates Gq of new elements that are added to existing subset of H. This procedure simultaneously produces H and its decomposition with respect to G. Representatives of the cosets, one from each excluding G are potential (nonequivalent) operators.
To draw Fig. 1(b), we also analysed the type of of potential operators using the following method. Let G be a rotational and M be a metric represented as a set of 6 × 6 matrices and as a 6vector, respectively. Let M be invariant with respect to G, i.e.
Consequently, the projector
is such that πM = M. Let R be a 6 × 6 matrix representing a potential operator. If
then
and no constraints are needed for M to be invariant with respect to R in addition to those imposed by (6). Therefore, if (7) holds, then the generated by R is This type of check requires no tables and can be performed in integers if the 6 × 6 matrix representation of G is generated from its 3 × 3 matrix representation in fractional coordinates.
APPENDIX B
Consider, for example, a threefold S_{twin} relating three domains. Let h′ = S_{twin}h and h′′ = S_{twin}h′ and h, h′ and h′′ be different (be in a general position with respect to S_{twin}) and corresponding intensities , , be measured. For fractions α_{1}, α_{1}, α_{1} and neglected errors,
operatorRelations (2) can be verified as follows.
In the absence of α_{1} = 1 and α_{2} = α_{3} = 0, and therefore for all h with measured, = and hence R_{twin}^{obs} = R_{twin}^{calc}.
For perfect α_{1} = α_{2} = α_{3} and = = I and R_{twin}^{obs} = 0.
For partial
If intensities are not equal to zero, then the two sides of the above relation are equal only if = = . Hence, assuming that the crystal symmetry is correctly specified and thus there are at least some nonzero nonequal triplets of calculated intensities, we have R_{twin}^{obs} < R_{twin}^{calc}.
Relations (2) can be similarly derived for any kind of including the usual case of two fractions.
Let us estimate the expected value of R_{twin}^{calc} (in the absence of any twinning) defined in (1), assuming that (i) there is no RPS and (ii) the overall ADP tensor and the set of TNCS vectors (if TNCS is present) are invariant with respect to S_{twin}. Formally, these mean (i) mutual independence of all random variables I_{h} and (ii) identical exponential distribution of random variables I_{h} and I_{h′}, h′ = S_{twin}h. In particular, the random variables and possess the following joint probability distribution density
where the multipliers (β_{h}) at I_{h} and I_{h′} are the same.
For new variables
the joint probability distribution density is
and, in particular, the conditional probability distribution density of s_{h} given r_{h} is
Thus, the expected value of s_{h} given r_{h} is
Finally,
The last equation means that R_{twin} averaged over all possible structures obeying the above conditions (i) and (ii) equals one half exactly. For a particular structure, this means the approximate equation in (4).
These calculations can also be applied to two unrelated structures, as was performed by Srinivasan & Parthasarathy (1976) for a similar problem but for the conventional R factor.
It is important to stress that this interpretation does not mean that in the Xray experimental data the resolution shells with R_{merge} higher than 50% are useless. In the case of experimental data, experimental errors are necessarily present and their distribution is different from that used to derive the above relation. The estimation of the resolution cutoff is a completely different problem and has to be approached using different notions, such as the informational content of the data or the informational content of the data per unit of synchrotron time.
Footnotes
^{1}We say that crystal symmetry is misspecified when the reported in the PDB file is a of the true of the crystal, e.g. P4 instead of P422. Accordingly, in such cases the PDB file contains more molecules than should be in the of the crystal, but some of these molecules are actually related by the missing symmetry operator(s).
Acknowledgements
We thank Eleanor Dodson, George Sheldrick, Ian Tickle, Olga Moroz and Vladimir Levdikov for helpful discussions and practical examples. This work was supported by BBSRC (AAL and AAV, grant reference B10670) and the Wellcome Trust (GNM).
References
Barends, T., deJong, R., van Straaten, K., Thunnisen, A.M. & Dijkstra, B. (2005). Acta Cryst. D61, 613–621. Web of Science CrossRef CAS IUCr Journals Google Scholar
Berman, H. M., Battistuz, T., Bhat, T. N., Bluhm, W. F., Bourne, P. E., Burkhardt, K., Feng, Z., Gilliland, G. L., Iype, L., Jain, S., Fagan, P., Marvin, J., Padilla, D., Ravichandran, V., Schneider, B., Thanki, N., Weissig, H., Westbrook, J. D. & Zardecki, C. (2002). Acta Cryst. D58, 899–907. Web of Science CrossRef CAS IUCr Journals Google Scholar
Bernstein, F. C., Koetzle, T. F., Williams, G. J., Meyer, E. F. Jr, Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). J. Mol. Biol., 112, 535–542. CrossRef CAS PubMed Web of Science Google Scholar
Bragg, W. L. & Howells, E. R. (1954). Acta Cryst. 7, 409–411. CrossRef CAS IUCr Journals Web of Science Google Scholar
Britton, D. (1972). Acta Cryst. A28, 296–297. CrossRef IUCr Journals Web of Science Google Scholar
Cochran, W. & Howells, E. R. (1954). Acta Cryst. 7, 412–415. CrossRef CAS IUCr Journals Web of Science Google Scholar
Collaborative Computational Project, Number 4 (1994). Acta Cryst. D50, 760–763. CrossRef IUCr Journals Google Scholar
Dauter, Z. (2003). Acta Cryst. D59, 2004–2016. Web of Science CrossRef CAS IUCr Journals Google Scholar
Dauter, Z., Botos, I., LaRondeLeBlanc, N. & Wlodawer, A. (2005). Acta Cryst. D61, 967–975. Web of Science CrossRef CAS IUCr Journals Google Scholar
Emsley, P. & Cowtan, K. (2004). Acta Cryst. D60, 2126–2132. Web of Science CrossRef CAS IUCr Journals Google Scholar
Flack, H. (1987). Acta Cryst. A43, 564–568. CrossRef CAS Web of Science IUCr Journals Google Scholar
French, S. & Wilson, K. (1978). Acta Cryst. A34, 517–525. CrossRef CAS IUCr Journals Web of Science Google Scholar
Friedel, G. (1926). Leçons de Cristallographie. Paris: Blanchard. Google Scholar
Giacovazzo, H. L., Monaco, H. L., Viterbo, D., Scordari, F., Gilli, G., Zanotti, G. & Catti, M. (1992). Fundamentals of Crystallography. Oxford University Press. Google Scholar
Grimmer, H. (2003). Acta Cryst. A59, 287–296. Web of Science CrossRef CAS IUCr Journals Google Scholar
Karkehabadi, S., Taylor, T. C. & Andersson, I. (2003). J. Mol. Biol. 334, 65–73. Web of Science CrossRef PubMed CAS Google Scholar
Kleywegt, G. J. (1999). Acta Cryst. D55, 1878–1884. Web of Science CrossRef CAS IUCr Journals Google Scholar
Kleywegt, G. J. (2000). Acta Cryst. D56, 249–265. Web of Science CrossRef CAS IUCr Journals Google Scholar
Kleywegt, G. J. & Jones, T. A. (1996). Acta Cryst. D52, 826–828. CrossRef CAS Web of Science IUCr Journals Google Scholar
Le Page, Y. (2002). J. Appl. Cryst. 35, 175–181. Web of Science CrossRef CAS IUCr Journals Google Scholar
Li, T., Ji, X., Fun, F., Gao, R., Cao, S., Peng, Y. & Rao, Z. (2002). Acta Cryst. D58, 870–871. Web of Science CrossRef CAS IUCr Journals Google Scholar
Lougheed, J. C., Holton, J. M., Alber, T., Bazan, J. F. & Handel, T. M. (2001). Proc. Natl Acad. Sci. USA, 98, 5515–5520. Web of Science CrossRef PubMed CAS Google Scholar
Mancheno, J., MartinBenito, J., MartinezRipoll, M., Gavilanes, J. & Hermoso, J. (2003). Structure, 11, 1319–1328. Web of Science CrossRef PubMed CAS Google Scholar
Mighell, A. D. & Rodgers, J. R. (1980). Acta Cryst. A36, 321–326. CrossRef CAS IUCr Journals Web of Science Google Scholar
Morgan, N., Pereira, I., Andersson, I., Adlington, R., Baldwin, J., Cole, S., Crouch, N. & Sutherland, J. (1994). Bioorg. Med. Chem. Lett. 4, 1595–1600. CrossRef CAS Web of Science Google Scholar
Murshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). Acta Cryst. D53, 240–255. CrossRef CAS Web of Science IUCr Journals Google Scholar
Padilla, J. & Yeates, T. (2003). Acta Cryst. D59, 1124–1130. Web of Science CrossRef CAS IUCr Journals Google Scholar
R Development Core Team (2004). R: A language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.Rproject.org . Google Scholar
Rees, D. (1980). Acta Cryst. A36, 578–581. CrossRef CAS IUCr Journals Web of Science Google Scholar
Rossmann, M. G. & Blow, D. M. (1962). Acta Cryst. 15, 24–31. CrossRef CAS IUCr Journals Web of Science Google Scholar
Rudolph, M. G., Kelker, M. S., Schneider, T. R., Yeates, T. O., Oseroff, V., Heidary, D. K., Jennings, P. A. & Wilson, I. A. (2003). Acta Cryst. D59, 290–298. Web of Science CrossRef CAS IUCr Journals Google Scholar
Schlessman, J. & Litvin, D. B. (1995). Acta Cryst. A51, 947–949. CrossRef Web of Science IUCr Journals Google Scholar
Sheldrick, G. & Schneider, T. R. (1997). Methods Enzymol. 277, 319–343. CrossRef PubMed CAS Web of Science Google Scholar
Srinivasan, R. & Parthasarathy, S. (1976). Some Statistical Applications in Xray Crystallography. Oxford: Pergamon. Google Scholar
Vagin, A. & Teplyakov, A. (1997). J. Appl. Cryst. 30, 1022–1025. Web of Science CrossRef CAS IUCr Journals Google Scholar
Vaguine, A. A., Richelle, J. & Wodak, S. J. (1999). Acta Cryst. D55, 191–205. Web of Science CrossRef CAS IUCr Journals Google Scholar
Wilson, A. J. C. (1949). Acta Cryst. 2, 318–321. CrossRef IUCr Journals Web of Science Google Scholar
Yeates, T. (1997). Methods Enzymol. 276, 345–358. Google Scholar
Yeates, T. O. & Fam, B. C. (1999). Structure Fold. Des. 7, R25–R29. Web of Science CrossRef PubMed CAS Google Scholar
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.