research papers
Detection of translational
in Patterson functions^{a}Crystallographic Methods, Institute of Molecular Biology of Barcelona (IBMB–CSIC), Baldiri Reixac 15, 08028 Barcelona, Spain, ^{b}Department of Haematology, Cambridge Institute for Medical Research, University of Cambridge, Hills Road, Cambridge CB2 0XY, United Kingdom, ^{c}Lawrence Berkeley National Laboratory, One Cyclotron Road, BLDG 64R0121, Berkeley, CA 93720, USA, and ^{d}ICREA, Passeig de Lluís Companys 23, 08010 Barcelona, Spain
^{*}Correspondence email: ajm201@cam.ac.uk
Detection of translational
(TNCS) can be critical for success in crystallographic phasing, particularly when molecularreplacement models are poor or anomalous phasing information is weak. If the correct TNCS is detected then expected intensity factors for each reflection can be refined, so that the functions underlying and singlewavelength use appropriate structurefactor normalization and variance terms. Here, an analysis of a curated database of protein structures from the Protein Data Bank to investigate how TNCS manifests in the is described. These studies informed an algorithm for the detection of TNCS, which includes a method for detecting the number of vectors involved in any commensurate modulation (the TNCS order). The algorithm generates a ranked list of possible TNCS associations in the for exploration during structure solution.Keywords: translational noncrystallographic symmetry; maximum likelihood; intensity statistics; molecular replacement.
1. Introduction
Translational et al., 1998). and is problematic if the systematic modulation is not taken into account, because the intensity modulation caused by TNCS breaks the implicit assumptions used in likelihoodbased methods that the intensities, and the errors in predicting the intensities from the model, follow an isotropic Wilson distribution (Wilson, 1949).
(TNCS) arises when the contains components that are oriented in (nearly) the same way and can be superimposed by a translation that does not correspond to any in the There is overall modulation of the intensities: systematically strong and systematically weak intensities (ChookThe modulations of the intensities arise because the contribution to a et al., 2013) in order to correctly account for TNCS modulation (Fig. 1).
of molecules related by TNCS have the same (or similar) amplitudes but have relative phases determined by the projection of the translation vector on the diffraction vector. As a result, they interfere constructively for some reflections and destructively for others, so that there is a systematic modulation of the sum of their contributions. The planes affected by intensity modulation are perpendicular to the translation vectors between copies related by TNCS (TNCS vectors). The degree of modulation is less significant if there are rotational and/or conformational differences between the copies, and decreases with increasing resolution. For this reason, in addition to the TNCS vector it is also necessary to estimate any small rotational differences in their orientations (TNCS rotations) and the size of random coordinate differences (TNCS r.m.s.d.) caused by conformational differences (ReadThe parameters characterizing TNCS (TNCS vector, TNCS rotation and TNCS r.m.s.d.) are used to generate expected intensity factors for each reflection. Note that the total expected intensity factor for a reflection includes the usual integer factor for the number of times the ). The TNCS component of the expected intensity factor that models the modulations observed in the data is noninteger (Read et al., 2013), being below 1 for the systematically weak reflections and above 1 for the systematically strong reflections.
of a reflection is identical under all of the distinct pure rotational symmetry operations of the (Stewart & Karle, 1976After initial estimation, the parameters of the TNCS model are refined, via the expected intensity factors for each reflection derived from the TNCS model, using a likelihood function given by the Wilson distribution of the data (McCoy, 2007).
TNCS does not necessarily associate two components in the n) components associated by a series of vectors that are multiples of 1, 2, 3 … (n − 1) times a basic translation vector. We call n the order of the TNCS and indicate it as TNCS_{n}. Where n times the basic translation vector equates to (or is very close to) a sum of integer multiples of the unitcell basis vectors, the TNCS describes a pseudocell, and this case is known as commensurate modulation.
but may relate three or more (The presence of TNCS is shown by the presence of a strong offorigin peak in the ) caused by the overlap of multiple parallel and equallength interatomic vectors. In phenix.xtriage (Zwart et al., 2005), TNCS is flagged as present if a calculated with data from 5 to 10 Å resolution has a peak more than 15 Å from the origin which is at least 20% of the originpeak height. The rationale for the resolution limits is to enhance the signal for the lowresolution molecular transform, and the rationale for the distance threshold is to exclude the origin peak and any internal pseudotranslational symmetry such as in helices. However, there has not been a systematic study of the parameters of this approach, nor of how accurate it is in the detection of TNCS. In addition, this approach does not automatically give the order of the TNCS, which is critical for correcting the modulations. In the context of developing automated structuresolution strategies, we are also interested in ranking alternative hypotheses for TNCS.
(Patterson, 19352. Materials and methods
2.1. Database
The database for the study was derived from an initial subset of 90 083 crystal structures from the PDB (Burley et al., 2019) deposited between 1976 and 2018 and for which there were also deposited Xray intensities or amplitudes. Structures containing or highly αhelical proteins (75% or more helical content), such as coiled coils, were excluded, since these structural classes are known to have characteristically high intensity modulation even in the absence of TNCS. The helical content was calculated following the distribution of characteristic vectors (CVs; Medina et al., 2020) defined by the centroids of C^{α} and carbonyl O atoms from consecutive and overlapping heptapeptides. The intensity modulations generated by the helical repeats in these structures cannot be corrected by modelling them as TNCSgenerated modulations and thus are beyond the scope of this study. Also excluded from the database were collagens, viruses, small nonproteins (antibiotics and peptides), structures with a mean occupancy of less than 0.75 and structures where only the C^{α}atom coordinates were deposited.
Curation included the following checks on data quality: (i) retracted entries were deleted, (ii) obsolete structures were replaced by the valid entries as of October 2018, (iii) where PDB entries had MTRIX cards to represent NCS operators, the phenix.pdb.mtrix_reconstruction script (Liebschner et al., 2019) was used to reconstruct the crystallographic and the transformation given in the SCALE cards was used to place the model in the and (iv) data in the form of unmerged intensities were converted to merged intensities with phenix.reflection_file_converter using the nonanomalous option (Liebschner et al., 2019). Finally, a small subset of structures for which our scripts failed were substituted with data or coordinates from the PDBREDO database (Joosten et al., 2012) if that solved the issue, or else were deleted without further examination of the causes.
Since the TNCS modulations of intensities become less pronounced at high resolution, where the data extended to high resolution they were truncated to 3 Å resolution in order to save run time in the calculations. Our initial studies were performed without regard to the completeness of the data, but we observed that incomplete data caused outliers in our preliminary analysis, and so our primary database was further curated to remove cases where the data were less than 80% complete, and a separate database was maintained to further study the effects of incompleteness.
The final curated database contains 80 482 structures. Its characteristics and genesis are summarized in Table 1. The small database of structures with data completeness less than 80% consisted of 1294 cases. Both databases are available upon request from the authors.

2.2. Computing and software
The atomic coordinates of structures deposited in the PDB were analysed and TNCS, if any, was identified using the mmtbx.ncs package from the mmtbx module of the Computational Crystallography Toolbox (cctbx; GrosseKunstleve et al., 2002). In this algorithm, chains with high sequence identity are identified. They are then structurally superimposed, testing each crystal including the identity, and if they superimpose with a translation then the pair is added to a growing list of TNCSrelated chains in the The translation can include a rotational tolerance defined by an angular threshold. After all combinations of sequencematched chains and symmetry operations have been considered, the list is analysed to find the largest TNCS order. Importantly, the analysis forces the TNCSrelated molecules to form a so, for example, if the rotational tolerance is 3°, and A superimposes on B with a 2° rotation, B superimposes on C with a 2° rotation and A superimposes on C with a 4° rotation, then A, B and C form a TNCS group of order 3 even though A and C do not superimpose within the tolerance of 3°. In the limit of high angular tolerances, highorder rotational symmetry will be misidentified as highorder translational symmetry (see, for example, Albertini et al., 2006; PDB entry 2gtt). The package reports the chain identifier of the TNCSrelated chains, the TNCS vector in fractional and orthogonal coordinates, the rotational difference and the percentage of total scattering for the pairs of molecules related by TNCS.
The ) then the transformation was reversed to obtain intensities (Read & McCoy, 2016). If only structurefactor amplitudes were available and these had not been generated using the French and Wilson algorithm, the intensity was taken as the square of the structurefactor amplitude; the information loss meant that reflections with negative experimental intensity were set to zero intensity. All data were used without applying an I/σ(I) selection criterion.
was calculated from the deposited data. Where mean intensities were available, reflections recorded as net positive were used for the calculation. If only anomalous intensities were available, a mean intensity was calculated as a simple average of the Friedel mates or using the singleton intensity if only one Friedel mate was present. If only structurefactor amplitudes were available and these had been generated by the French and Wilson procedure (French & Wilson, 1978The TNCS correction terms were calculated with the Phasertng software package (McCoy et al., 2021) using algorithms like those implemented in Phaser (McCoy, 2007; Read et al., 2013; Sliwiak et al., 2014; Read & McCoy, 2016; Jamshidiha et al., 2019). When the TNCS order is greater than 2 the relative orientations between the components related by the TNCS are not included in the model for TNCS, but their effect is absorbed approximately by the TNCS r.m.s.d. parameter. Correction terms are applied to the observed and calculated structure factors during all likelihood calculations involved in and single (SAD) phasing.
Figures were prepared with the PyMOL Molecular Graphics System (version 1.8; Schrödinger) and Matplotlib version 1.5.3 (Hunter, 2007).
The decision tree was generated using the scikitlearn Python library version 0.18.1 (Pedregosa et al., 2011).
Calculations were performed on a multiprocessing workstation with two quadcore Intel Xeon processors X5560 at 2.80 GHz with 24 GB RAM and on an 18core workstation with Intel Core i99980XE at 3.00 GHz with 64 GB RAM, both with the operating system Debian GNU/Linux 9.
3. Results
3.1. TNCS in real space
The first question to arise when studying TNCS is `What constitutes TNCS?' This is not a simple question to answer. The effects of TNCS form a continuum between exact TNCS and molecules in the
oriented with large rotation angles with respect to one another (general NCS).Our initial approach was to use the coordinates for decision making. Whether or not coordinates have TNCS depends on the choice of a rotational tolerance. In our experience of TNCS parameter et al., 2013). Coordinate analysis was therefore carried out exploring a wide range of rotational tolerances from 0° to 20°. The results are shown in Table 2. At small angular tolerances of less than 5°, one in 20 of the structures in the database were flagged as having TNCS, while at 10° tolerance this had increased to nearly one in ten and by 20° it was one in seven. Furthermore, in some cases the order of the TNCS also increased with tolerance; 6% of the TNCS was higher order TNCS (n > 2) at 2° tolerance and 14% at 20° tolerance. Most of the increase in the order of the TNCS occurred when increasing the tolerance from 2° to 5°, because higher order TNCS often has subsets of components that are more closely related than others, and what appears to be complex loworder TNCS at small tolerances reduces to a simple highorder TNCS at larger tolerances. We refer to the coordinatesbased test for TNCS as pdbTNCS(r°), where the angle r is the angular tolerance and the value is true/false.
TNCS rotations can refine to values up to 10° (Read

3.2. vectorlength threshold
αhelices and βsheets. The distances arising from these pseudosymmetries are less than 15 Å, which has been used as the threshold distance for exclusion (Zwart et al., 2005, 2008). We wished to determine whether this distance was larger than any TNCS vector in the PDB.
intramolecular vectors cluster around the origin peak. These peaks, which constitute noise in the context of searching for TNCS vectors, can be excluded by setting a minimum vectorlength threshold. The shortest TNCS vector that is possible in any given case will depend on the shortest intermolecular spacing, and this distance could be used as a constraint on the TNCS vector. However, the shortest extent is not known before only by assuming a spherical molecule could a reasonable estimate of the average molecular extent be made from the molecular weight for a completely unknown structure. Independently, there is a need to exclude short vectors because of pseudosymmetry in secondarystructure elements, such asThe shortest TNCS vector in our database was 22.43 Å for the structure with PDB code 3i57 (MacKenzie et al., 2009), with a fractional translation vector of (0.5, 0, 0) and a rotational tolerance of 6.7°. The structure of PDB entry 3i57 is shown in Fig. 2(a) and its in Fig. 2(b). We conclude that the 15 Å distance from the origin of the peak is suitable for excluding selfvectors while not excluding any true TNCS vectors.
3.3. peak threshold
Our next step was to investigate the correlation of pdbTNCS(r°) with the peak heights in the Fig. 3 shows the histograms for the distribution of top nonorigin peak heights. Results are shown for Patterson functions calculated with data between 5 and 10 Å resolution and with different pdbTNCS(r°) angular tolerances. Other resolution ranges are shown in Supplementary Fig. S1. The top nonorigin peak was expressed as a percentage of the height of the origin peak and as a Zscore value (the number of standard deviations above the mean value). For pdbTNCS(2°), the histogram showed that the traditional Patterson 20% of the origin peak threshold was broadly correct; this gave an accuracy (defined below) of 96%. However, for pdbTNCS(15°) the accuracy began to break down (94%), and by pdbTNCS(20°) it was only 92%.
3.4. Decision tree
We used a decision tree (Breiman et al., 1984), which is a predictive modelling approach used in statistics, data mining and machine learning, to develop criteria for distinguishing between the presence and absence of TNCS (Fig. 4). The database was divided randomly into a training set (75%) and a test set (25%). The Gini index (equation 1) was used as a criterion for calculating discrimination. The Gini index is a measure of statistical dispersion defined as twice the area between the receiver operating characteristic (ROC) curve and its diagonal:
The training set was used to train the algorithm, and included information on pdbTNCS(r°) and the highest nonorigin peaks. The algorithm resulting from the decision tree was then applied to the test set, which only had the information for the highest nonorigin peak. Since there was only one parameter to fit for each decision tree (the height of the peak) we did not need crossvalidation to avoid overfitting. A confusion matrix was generated in order to compute the accuracy (ACC), sensitivity (SN), falsepositive rate (FPR) and precision (PREC) of the algorithm, where, given that TP are true positives, TN are true negatives, FP are false positives and FN are false negatives,
The
resolution ranges explored were 3–10, 4–10, 5–10, 3–15, 4–15 and 5–15 Å. Following our study of the length of TNCS vectors, only peaks further than 15 Å from the origin peak were accepted.Tables 3 and 4 show that whatever the resolution or pdbTNCS(r°) rotational tolerance, suitable thresholds based on either percentage of the origin peak or Zscores could be found for highaccuracy decision making; we call the associated threshold t values the Pattersont% and PattersonZt, respectively. Smaller rotational tolerances favoured the use of higher resolution data. Except for five cases highlighted in Table 4, the PattersonZt gave slightly higher accuracies than the Pattersont%.


Taking pdbTNCS(10°) as a useful measure of TNCS, the best predictions, which had 97.6% accuracy (equation 2), used Patterson functions calculated between 5 and 15 Å resolution and a PattersonZt threshold where t = 11.36. Only slightly poorer accuracy, 96.5%, could be obtained using the traditional 5–10 Å resolution range and a Pattersont% threshold, but this required t = 16.8% rather than t = 20%, implying that the previous Pattersont% threshold for TNCS was too conservative. Since altering the resolution range and using a PattersonZt threshold had only a marginal effect on accuracy, we decided to use the traditional 5–10 Å resolution range and Pattersont% threshold for our algorithm, although with a lowered threshold value. Using the narrower resolution range also guards against any technical problems when collecting the lowresolution data.
3.5. False positives and false negatives
The false positives and false negatives were further investigated. The sensitivity (equation 3) of the algorithm was 85% and the precision (equation 4) was 88%, while the falsepositive rate (equation 5) was 1%, indicating that the algorithm identifies cases of no TNCS exceptionally well, but fails to identify some cases with TNCS. With only one parameter to fit, there is a simple tradeoff between identifying false negatives and false positives. The bias in the classifier towards no TNCS comes about because the database contains a higher proportion of structures without TNCS. If we assume that novel data sets will be no more biased towards having TNCS than deposited structures, then the bias is appropriate for accuracy. It is possible that the proportion of crystals that grow with TNCS is higher than that represented by the database because these structures are less likely to be solved; however, we cannot quantify this.
Both false positives and false negatives will impact structure solution by
or experimental phasing.False positives occurred where the top peak in the r°) was false. False positives are particularly severe in the context of structure solution because TNCS will be forced to apply to the components in the (whether molecularreplacement models or heavy atoms) when there is none. Therefore, the falsepositive rate (equation 5) of 1% was significant for practical applications even though low.
was above the threshold but pdbTNCS(False negatives occurred where the r°) was true. False negatives will mean that intensity modulations are not corrected, and in order to succeed structure solution by will then require highquality models or, for SAD phasing, the anomalous signal will need to be strong.
peak was below the threshold proposed by the decision tree but where pdbTNCS(Some of the false negatives in the pdbTNCS(10°) confusion matrix could be rescued by considering a larger angular tolerance. Indeed, 353 of 869 of the false negatives are true according to pdbTNCS(20°). Note that this is not equivalent to using the decision tree generated with pdbTNCS(20°), which includes additional false negatives. This phenomenon was true for every pdbTNCS(r°) that we analysed; false negatives could be rescued by considering larger perturbation rotation angles.
3.6. TNCS in reciprocal space
The studies in real space showed that using a r°) as the definition of TNCS. However, the optimal peak threshold depended critically on the rotation r used for the classification, with the peak threshold becoming lower as r increased. Furthermore, an increasing number of structures that did not have pdbTNCS(r°) were detected as having TNCS as the peak threshold was lowered. The studies using the realspace classifier clearly demonstrated the problem of TNCS being a continuum between exact TNCS and NCS. The problem of false negatives lay not in the threshold, but in the realspace classifier of pdbTNCS(r°).
peak threshold gave high accuracy for detecting TNCS when using pdbTNCS(There are several reasons why pdbTNCS(r°) may not correspond to significant modulations in the data. If the TNCSrelated components are large, the radius of the molecular Gfunction (Rossmann & Blow, 1962) is small so that the modulations fall off faster with orientational differences (Read et al., 2013). If the TNCSrelated copies differ substantially in conformation, the modulations fall off faster with resolution. Finally, if the symmetryrelated TNCS vectors are very different, modulations arising from the symmetryrelated copies will tend to cancel.
The scope of this study is to determine initial parameters for the model of TNCS so that the
of TNCS intensity correction factors can proceed. Therefore, if the resulting modulations are not significant then TNCS is effectively not present for our purposes: if the (insignificant) TNCS epsilon factors are omitted there will be no impact on structure solution.3.7. Epsilonfactor distribution
We examined the distribution of epsilon factors after σ_{1}^{2}) as the statistic for measuring the degree of modulation,
as an alternative classifier for the presence or absence of TNCS. Refined epsilon factors that cluster around 1 define unmodulated data, while those that refine to the extremes of the distribution define high modulation. We use the variance about 1 (We call this epsTNCS, and it takes a range of values between 0 and (n/2)^{2} + [(n/2) − 1]^{2}, although in practice it is less than 1 in all but extraordinary circumstances. Histograms showing examples of the distribution of epsilon factors and their associated epsTNCS are presented in Fig. 5.
The distribution of epsTNCS values versus Pattersont% is shown in Fig. 6. There is a clear linear relationship between the two: Patterson peak height is directly related to modulation in the data. The PattersonZt had a lower (0.82) with the epsTNCS than the Pattersont%. The between the epsTNCS and the Pattersont% was 0.934 and was calculated with epsTNCS refined against 5–10 Å resolution data and Patterson functions calculated with 5–10 Å resolution data.
This analysis demonstrated that the false negatives in the algorithm, as determined by pdbTNCS(r°) (a binary measure), were cases in which the epsTNCS (a real number) was low and therefore their misclassification should not strongly impact structure solution. It also demonstrates that the peak height is a good measure for the ranking of a TNCS hypothesis.
3.8. Completeness
It has long been known that complete, goodquality data are necessary for successful ). In the course of our study, we noted that the completeness of the data had a significant effect on the accuracy of our Patterson functionbased decision tree. Eight cases [PDB entries 3c6o (Hayashi et al., 2008), 1jpn (Padmanabhan & Freymann, 2001), 1sxh (Schumacher et al., 2004), 1n8o (C. Cambillau, S. Spinelli & M. Lauwereys, unpublished work), 1eam (Hu et al., 1999), 1wwr (Kuratani et al., 2005), 3it5 (Spencer et al., 2010) and 1lbs (Uppenberg et al., 1995)] had high peaks but no significant epsilonfactor dispersion. There was one outlier (PDB entry 3he1; Osipiuk et al., 2011) with a variance about 1 (equation 6) of nearly 1.6 for TNCS_{6}, the only case we observed for which σ_{1}^{2} was greater than one (Supplementary Fig. S2). This figure shows that lowcompleteness data resulted in several other outliers in the Pattersont% versus σ_{1}^{2} scatter plot. The accuracy of the decision tree deteriorated with decreasing completeness (Supplementary Fig. S3). We have not investigated the distribution of missing data in these data sets; however, when large percentages of data are missing it is normally because the user has failed to collect a wedge of data, either through initial misidentification of the true radiation damage causing data quality to drop so that later parts of a data collection must be excluded, or a high number of overlapped reflections in a section of the data (for example due to one long unitcell dimension). Lacking a wedge of data will impact the epsTNCS because systematic omission of data for a direction in leaves parameters in real space perpendicular to that direction undefined. In addition, missing wedges of data complicate data processing, and if due to overlaps some reflections may be integrated including partial intensity from a neighbouring reflection; any such rogue highintensity reflections cause strong modulation of the Patterson function.
using methods (Navaza, 19943.9. Latticetranslocation disorder
For the cases of false positives, Patterson functions were calculated from the coordinates and compared with the observed Patterson functions. In all cases, the highest nonorigin et al., 2007; Dauter et al., 2005). Interestingly, the distribution of space groups in these structures differed significantly from the distribution across all deposited structures, with P2_{1} present at three times the expected number (see Table 5). The 2_{1} screw has been implicated as an important component of polytropism for crystals (Aquilano et al., 2003).
peak from the calculated data was below the 20% threshold. It is possible that these structures show a degree of latticetranslocation disorder, with stacking heterogeneity between mosaic blocks (Rye

4. TNCS detection
Our algorithm for TNCS detection not only determines the TNCS vector and the TNCS order, but also has tests that aim to exclude pathological cases. Firstly, a _{2} vector, with Patterson% indicating the strength of the associated modulation, which provides a ranking for the hypotheses.
is calculated from the data, by default using 5–10 Å resolution data. Peaks are picked in the and filtered using two criteria: the peak height must be above a given percentage of the originpeak height and the peak distance must be more than a given distance from the origin. As guided by this study, the default distance threshold is 15 Å and the default threshold is 16.8%. Cases in which at least one of the unitcell dimensions is less than the origin distance threshold are considered to be pathological (most likely peptides) and are excluded from further analysis. If there are no surviving nonorigin distinct peaks over the Patterson% threshold, the algorithm terminates with status `TNCS not indicated', otherwise the algorithm proceeds to analysis of the TNCS order. The simplest interpretation of surviving peaks is that each (if there are more than one) presents an independent TNCSWe then perform further analysis to determine whether the t% threshold of the analysis. Following our studies on epsTNCS and the high correlation with the height of the highest peak, we rank commensurate modulations that predict the highest ranked peak higher than those that do not.
peaks are due to a higher order TNCS commensurate modulation and, if so, the order of that commensurate modulation. Noise in the is removed by setting all values below 8% of the origin peak to zero, and the noisereduced is transformed to where commensurate modulation is detected as strong loworder Fourier terms. The hypothesis for a given commensurate modulation will predict a set of equalheight peaks in the In practice, because the components are not related by a perfect translation (as previously discussed) these predicted peaks will have different heights and some may be below the PattersonThe result of the algorithm is a ranked list of TNCS modulations representing highorder commensurate TNCS_{n} and commensurate and noncommensurate TNCS_{2}. Following our observation that high peaks in the data may be due to order–disorder effects, the case of no TNCS is also always included in the list of hypotheses. Note that the ranking is not necessary for structure solution. In the context of an automated pipeline, as long as the correct hypothesis is in the list, it will be explored. The ranking only affects the order in which the hypotheses are explored, and hence the efficiency of structure solution.
An unoptimized part of the algorithm attempts to prevent the misclassification of coiled coils and amyloid peptide repeats as having TNCS. As previously discussed, pseudosymmetry in secondarystructure elements generates large peaks in the et al., 2008). Taking a heuristic approach, we exclude peaks from the TNCS analysis if they cluster together with the short distance separation characteristic of coiled coils. Future work will perform a systematic study of coiled coils and amyloid peptide repeats to optimize the TNCSdetection algorithm in these cases. Note that it is the clustering of a number of peaks corresponding to the helical repeat distance that is characteristic of coiled coils, rather than the presence of a peak close to the origin per se.
close to the origin. Although coiled coils were excluded from our curated database, by looking at a small number of cases it was observed that the 15 Å minimum vector exclusion around the origin was not sufficient to exclude peaks generated by the coiledcoil pseudosymmetry (Kondo5. Discussion
We have developed an algorithm for characterizing and ranking TNCS hypotheses by analysis of the intensities prior to structure solution. Correct identification of TNCS can have a profound impact on the ability to place components in the
whether they be components by or heavy atoms by experimental phasing. In the context of a pipeline for structure solution, the fastest route to structure solution on average should be by exploring the TNCS hypotheses in order of ranking by our criteria. Future work will develop our automation strategies to make optimal use of this information and will include dynamic reranking of TNCS hypotheses.Unexpectedly, several entries in our database had significant Archaeoglobus fulgidus Lon protease (Dauter et al., 2005; PDB entry 1z0v), a structure known to be an (see also PDB entry 1z0t; Lebedev, 2009). Individual crystals belonged to space groups P2_{1} and P2_{1}2_{1}2_{1}, with the transition layers in plane P2_{1}2_{1}(2) giving a sequence of stacking vectors. Another case was lipase B from Candida antarctica, which is also known to be an order–disorder twin. In this case, the two space groups involved were C2 and P2_{1}2_{1}2_{1}, with the transition layers again in plane P2_{1}2_{1}(2). The deposited data for PDB entry 1lbs (Uppenberg et al., 1995) were processed in the larger, orthorhombic lattice, which resulted in an apparent data completeness of 27.5%, although the completeness in the actual C2 was 82.4%. In terms of our study, this structure was included in the small database of structures with less than 80% complete data; however, had it been included in the main database it would have been the most extreme falsepositive outlier. In another case, the Ftsk motor domain from Escherichia coli (Massey et al., 2006; PDB entry 2ius), the indexing and spacegroup determination for the crystal was problematic (Jan Löwe, personal communication). We thus hypothesize that these outliers are as a result of structures with a latticetranslocation defect rather than TNCS. In the context of automated it is therefore important to consider the absence of TNCS even in the context of large peaks being present.
peaks despite not having TNCS. One of these cases was the proteolytic domain ofIn the course of our study, we also noted a few cases in which subgroups of components were related by different TNCS vectors. These cases tended towards pseudocentring in multiple directions. For example, a small ligandbound complex of von Hippel–Lindau (VHL) E3 ubiquitin ligase and the hypoxiainducible factor (HIF) alpha subunit (Galdeano et al., 2014; PDB entry 4w9d; P4_{1}22) showed a pseudocentring in the a (0.5, 0.04, 0.0) and ab diagonal (0.54, 0.5, 0.0) directions, and similarly the of the SOAR domain (Yang et al., 2012; PDB entry 3teq; P4_{1}2_{1}2) showed pseudocentring in the a (0.49, 0.01, 0.0) and ab diagonal (0.49, 0.51, 0.0) directions. If there are subgroups of components related by different TNCS vectors or if only some components of the are related by a TNCS vector, then the modulations of the expected intensities due to the TNCS will be much less significant and structure solution may be achieved without any TNCS correction being applied, as indeed was the case in these examples. However, if structure solution fails, detecting and correcting the dominant order of TNCS within the may be enough.
In this work, we have not attempted to model either the TNCS rotation or the TNCS r.m.s.d. from the
Some information about these parameters is contained in the peak height relative to the origin peak, with lower peak heights indicating greater deviation from perfect translation. There may also be information about rotational deviations in the threedimensional peak shape. However, in practice, of these parameters from several different TNCS rotation perturbations works extremely well, and in most cases all perturbations converge on to the same final TNCS rotation and TNCS r.m.s.d.Future improvements to the method could come from improvements in the coefficients used to calculate the
Downweighting coefficients with high experimental error may mitigate the differences seen between Patterson functions calculated with different resolution ranges. Work is in progress to optimize the information in Pattersonlike functions in this, and other, crystallographic contexts.Supporting information
Supplementary Figures. DOI: https://doi.org/10.1107/S2059798320016836/gm5078sup1.pdf
Funding information
IU acknowledges support from the Spanish Ministry of Economy and Competitiveness by grants BIO201564216P, PGC2018101370B100 and MDM2014043501 and from Generalitat de Catalunya by grant 2017SGR1192. IC acknowledges support from the Spanish Ministry of Economy and Competitiveness by grant BES2016076329. PVA acknowledges support from the US Department of Energy under Contract No. DEAC0205CH11231 and the PHENIX Industrial Consortium. RJR acknowledges support from Wellcome Trust Principal Research Fellowship grant 209407/Z/17/Z and National Institutes of Health grant P01GM063210. MDS gratefully acknowledges fellowship support from the European Union's Horizon 2020 research and innovation program under the Marie SkłodowskaCurie grant (number 790122).
References
Albertini, A. A. V., Wernimont, A. K., Muziol, T., Ravelli, R. B. G., Clapier, C. R., Schoehn, G., Weissenhorn, W. & Ruigrok, R. W. H. (2006). Science, 313, 360–363. CrossRef PubMed CAS Google Scholar
Aquilano, D., Pastero, L., Veesler, S. & Astier, J. P. (2003). Crystal Growth: From Basic to Applied, edited by S. Carrà & C. Paorici, pp. 47–64. Rome: Accademia Nazionale dei Lincei. Google Scholar
Breiman, L., Friedman, J., Olshen, R. & Stone, C. (1984). Classification and Regression Trees. New York: Chapman & Hall. Google Scholar
Burley, S. K., Berman, H. M., Bhikadiya, C., Bi, C., Chen, L., Costanzo, L., Di, Christie, C., Duarte, J. M., Dutta, S., Feng, Z., Ghosh, S., Goodsell, D. S., Green, R. K., Guranovic, V., Guzenko, D., Hudson, B. P., Liang, Y., Lowe, R., Peisach, E., Periskova, I., Randle, C., Rose, A., Sekharan, M., Shao, C., Tao, Y.P., Valasatava, Y., Voigt, M., Westbrook, J., Young, J., Zardecki, C., Zhuravleva, M., Kurisu, G., Nakamura, H., Kengaku, Y., Cho, H., Sato, J., Kim, J. Y., Ikegawa, Y., Nakagawa, A., Yamashita, R., Kudou, T., Bekker, G.J., Suzuki, H., Iwata, T., Yokochi, M., Kobayashi, N., Fujiwara, T., Velankar, S., Kleywegt, G. J., Anyango, S., Armstrong, D. R., Berrisford, J. M., Conroy, M. J., Dana, J. M., Deshpande, M., Gane, P., Gáborová, R., Gupta, D., Gutmanas, A., Koča, J., Mak, L., Mir, S., Mukhopadhyay, A., Nadzirin, N., Nair, S., Patwardhan, A., PaysanLafosse, T., Pravda, L., Salih, O., Sehnal, D., Varadi, M., Vařeková, R., Markley, J. L., Hoch, J. C., Romero, P. R., Baskaran, K., Maziuk, D., Ulrich, E. L., Wedell, J. R., Yao, H., Livny, M. & Ioannidis, Y. E. (2019). Nucleic Acids Res. 47, D520–D528. CrossRef PubMed Google Scholar
Chook, Y. M., Lipscomb, W. N. & Ke, H. (1998). Acta Cryst. D54, 822–827. Web of Science CrossRef CAS IUCr Journals Google Scholar
Dauter, Z., Botos, I., LaRondeLeBlanc, N. & Wlodawer, A. (2005). Acta Cryst. D61, 967–975. Web of Science CrossRef CAS IUCr Journals Google Scholar
French, S. & Wilson, K. (1978). Acta Cryst. A34, 517–525. CrossRef CAS IUCr Journals Web of Science Google Scholar
Galdeano, C., Gadd, M. S., Soares, P., Scaffidi, S., Van Molle, I., Birced, I., Hewitt, S., Dias, D. M. & Ciulli, A. (2014). J. Med. Chem. 57, 8657–8663. CrossRef CAS PubMed Google Scholar
GrosseKunstleve, R. W., Sauter, N. K., Moriarty, N. W. & Adams, P. D. (2002). J. Appl. Cryst. 35, 126–136. Web of Science CrossRef CAS IUCr Journals Google Scholar
Hayashi, K.I., Tan, X., Zheng, N., Hatate, T., Kimura, Y., Kepinski, S. & Nozaki, H. (2008). Proc. Natl Acad. Sci. USA, 105, 5632–5637. CrossRef PubMed CAS Google Scholar
Hu, G., Gershon, P. D., Hodel, A. E. & Quiocho, F. A. (1999). Proc. Natl Acad. Sci. USA, 96, 7149–7154. CrossRef PubMed CAS Google Scholar
Hunter, J. D. (2007). Comput. Sci. Eng. 9, 90–95. Web of Science CrossRef Google Scholar
Jamshidiha, M., PérezDorado, I., Murray, J. W., Tate, E. W., Cota, E. & Read, R. J. (2019). Acta Cryst. D75, 342–353. Web of Science CrossRef IUCr Journals Google Scholar
Joosten, R. P., Joosten, K., Murshudov, G. N. & Perrakis, A. (2012). Acta Cryst. D68, 484–496. Web of Science CrossRef CAS IUCr Journals Google Scholar
Kondo, J., Urzhumtseva, L. & Urzhumtsev, A. (2008). Acta Cryst. D64, 1078–1091. Web of Science CrossRef IUCr Journals Google Scholar
Kuratani, M., Ishii, R., Bessho, Y., Fukunaga, R., Sengoku, T., Shirouzu, M., Sekine, S. I. & Yokoyama, S. (2005). J. Biol. Chem. 280, 16002–16008. CrossRef PubMed CAS Google Scholar
Lebedev, A. A. (2009). PhD thesis. University of York, UK. Google Scholar
Liebschner, D., Afonine, P. V., Baker, M. L., Bunkóczi, G., Chen, V. B., Croll, T. I., Hintze, B., Hung, L.W., Jain, S., McCoy, A. J., Moriarty, N. W., Oeffner, R. D., Poon, B. K., Prisant, M. G., Read, R. J., Richardson, J. S., Richardson, D. C., Sammito, M. D., Sobolev, O. V., Stockwell, D. H., Terwilliger, T. C., Urzhumtsev, A. G., Videau, L. L., Williams, C. J. & Adams, P. D. (2019). Acta Cryst. D75, 861–877. Web of Science CrossRef IUCr Journals Google Scholar
MacKenzie, D. A., Tailford, L. E., Hemmings, A. M. & Juge, N. (2009). J. Biol. Chem. 284, 32444–32453. CrossRef PubMed CAS Google Scholar
Massey, T. H., Mercogliano, C. P., Yates, J., Sherratt, D. J. & Löwe, J. (2006). Mol. Cell, 23, 457–469. CrossRef PubMed CAS Google Scholar
McCoy, A. J. (2007). Acta Cryst. D63, 32–41. Web of Science CrossRef CAS IUCr Journals Google Scholar
McCoy, A. J., Stockwell, D. H., Sammito, M. D., Oeffner, R. D., Hatti, K. S., Croll, T. I. & Read, R. J. (2021). Acta Cryst. D77, 1–10. CrossRef IUCr Journals Google Scholar
Medina, A., Triviño, J., Borges, R. J., Millán, C., Usón, I. & Sammito, M. D. (2020). Acta Cryst. D76, 193–208. Web of Science CrossRef IUCr Journals Google Scholar
Navaza, J. (1994). Acta Cryst. A50, 157–163. CrossRef CAS Web of Science IUCr Journals Google Scholar
Osipiuk, J., Xu, X., Cui, H., Savchenko, A., Edwards, A. & Joachimiak, A. (2011). J. Struct. Funct. Genomics, 12, 21–26. CrossRef CAS PubMed Google Scholar
Padmanabhan, S. & Freymann, D. M. (2001). Structure, 9, 859–867. Web of Science CrossRef PubMed CAS Google Scholar
Patterson, A. L. (1935). Z. Kristallogr. 90, 517–542. CAS Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M. & Duchesnay, É. (2011). J. Mach. Learn. Res. 12, 2825–2830. Google Scholar
Read, R. J., Adams, P. D. & McCoy, A. J. (2013). Acta Cryst. D69, 176–183. Web of Science CrossRef CAS IUCr Journals Google Scholar
Read, R. J. & McCoy, A. J. (2016). Acta Cryst. D72, 375–387. Web of Science CrossRef IUCr Journals Google Scholar
Rossmann, M. G. & Blow, D. M. (1962). Acta Cryst. 15, 24–31. CrossRef CAS IUCr Journals Web of Science Google Scholar
Rye, C. A., Isupov, M. N., Lebedev, A. A. & Littlechild, J. A. (2007). Acta Cryst. D63, 926–930. Web of Science CrossRef CAS IUCr Journals Google Scholar
Schumacher, M. A., Allen, G. S., Diel, M., Seidel, G., Hillen, W. & Brennan, R. G. (2004). Cell, 118, 731–741. Web of Science CrossRef PubMed CAS Google Scholar
Sliwiak, J., Jaskolski, M., Dauter, Z., McCoy, A. J. & Read, R. J. (2014). Acta Cryst. D70, 471–480. Web of Science CrossRef CAS IUCr Journals Google Scholar
Spencer, J., Murphy, L. M., Conners, R., Sessions, R. B. & Gamblin, S. J. (2010). J. Mol. Biol. 396, 908–923. Web of Science CrossRef CAS PubMed Google Scholar
Stewart, J. M. & Karle, J. (1976). Acta Cryst. A32, 1005–1007. CrossRef CAS IUCr Journals Web of Science Google Scholar
Taylor, E. J., Gloster, T. M., Turkenburg, J. P., Vincent, F., Brzozowski, A. M., Dupont, C., Shareck, F., Centeno, M. S. J., Prates, J. A. M., Puchart, V., Ferreira, L. M. A., Fontes, C. M. G. A., Biely, P. & Davies, G. J. (2006). J. Biol. Chem. 281, 10968–10975. Web of Science CrossRef PubMed CAS Google Scholar
Uppenberg, J., Ohrner, N., Norin, M., Hult, K., Kleywegt, G. J., Patkar, S., Waagen, V., Anthonsen, T. & Jones, T. A. (1995). Biochemistry, 34, 16838–16851. CrossRef CAS PubMed Web of Science Google Scholar
Wilson, A. J. C. (1949). Acta Cryst. 2, 318–321. CrossRef IUCr Journals Web of Science Google Scholar
Wukovitz, S. W. & Yeates, T. O. (1995). Nat. Struct. Biol. 2, 1062–1067. CrossRef CAS PubMed Web of Science Google Scholar
Yang, X., Jin, H., Cai, X., Li, S. & Shen, Y. (2012). Proc. Natl Acad. Sci. USA, 109, 5657–5662. CrossRef CAS PubMed Google Scholar
Zwart, P. H., GrosseKunstleve, R. W. & Adams, P. D. (2005). CCP4 Newsl. Protein Crystallogr. 42, contribution 10. Google Scholar
Zwart, P. H., GrosseKunstleve, R. W., Lebedev, A. A., Murshudov, G. N. & Adams, P. D. (2008). Acta Cryst. D64, 99–107. Web of Science CrossRef CAS IUCr Journals Google Scholar
This is an openaccess article distributed under the terms of the Creative Commons Attribution (CCBY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.