research papers
Paired PAIREF
under the control ofaFaculty of Nuclear Sciences and Physical Engineering, Czech Technical University in Prague, Břehová 7, Prague 11519, Czech Republic, bInstitute of Biotechnology of the Czech Academy of Sciences, Biocev, Průmyslová 595, Vestec 25250, Czech Republic, and cUniversity of Konstanz, Box M647, Konstanz 78457, Germany
*Correspondence e-mail: petr.kolenko@fjfi.cvut.cz
Crystallographic resolution is a key characteristic of diffraction data and represents one of the first decisions an experimenter has to make in data evaluation. Conservative approaches to the high-resolution cutoff determination are based on a number of criteria applied to the processed X-ray diffraction data only. However, high-resolution data that are weaker than arbitrary cutoffs can still result in the improvement of electron-density maps and refined structure models. Therefore, the impact of reflections from resolution shells higher than those previously used in conservative structure PAIREF was developed to provide automation of this protocol. As a new feature, a complete cross-validation procedure has also been implemented. Here, the design, usage and control of the program are described, and its application is demonstrated on six data sets. The results prove that the inclusion of high-resolution data beyond the conventional criteria can lead to more accurate structure models.
should be analysed by the paired protocol. For this purpose, a tool calledKeywords: macromolecular crystallography; PAIREF; X-ray diffraction; paired refinement; high-resolution limit.
1. Introduction
Crystallographic resolution is understood as the minimum plane spacing given by Bragg's law for a particular set of X-ray diffraction intensities that are included in the structure analysis (Online Dictionary of Crystallography, https://dictionary.iucr.org/Resolution). In contrast, is defined as the expected minimum distance between two resolved peaks in the electron-density map (Vaguine et al., 1999). The resolution of data is limited due to a decrease in the intensity-to-noise ratio of reflections with the resolution. The weakness of the high-resolution data is caused by several factors, including the Lorentz-polarization factor, temperature factor and crystal imperfection. Therefore, the diffraction data are usually cut off at a certain resolution, with the aim of rejecting the data that do not improve the model.
In previous decades, conservative criteria were applied to estimate the resolution of crystallographic data. These criteria were based on a user-defined value of data quality indicators such as the signal-to-noise ratio 〈I/σ(I)〉, the disagreement residual of multiple observations Rmerge, etc. (Evans, 2011). Later, the Pearson CC1/2, quantifying the internal consistency of observations, was added to these criteria (Karplus & Diederichs, 2012). Inspection of the data deposited in the PDB (Berman et al., 2000) shows that there is no consensus in the application of these statistics. Moreover, the possibility of improvement of a refined model by employing a different resolution range was often not considered. Nowadays, the application of strict cutoff values on selected data quality indicators has been shown to be an obsolete approach (Diederichs & Karplus, 2013; Evans & Murshudov, 2013). Very recently, it became possible to estimate the information gain from each reflection using likelihood-based methods (Read et al., 2020). Yet this approach does not answer the question of which high-resolution cutoff should be used with current programs.
The ambiguity in the high-resolution-cutoff estimation has been removed with the advent of the `paired ). Initially, a conservative criterion is applied as usual to the high-resolution data and the is solved. Usually, the model is then significantly improved by In the paired protocol, the influence of the previously rejected high-resolution data during the structure is tested. The structure model is refined stepwise against data at higher and higher resolution until no improvement of the model is observed. More specifically, each increase in resolution is checked against the original resolution for its added value, particularly by comparing R values of models against the same data. Only those resolution shells that prove beneficial are included in the final data set, against which the structure is refined.
protocol (Karplus & Diederichs, 2012In this paper, we present a new tool – PAIREF – which helps to make the decision about the useful resolution of the data set. The program performs paired for validation of the high-resolution data in a fully automatic way. PAIREF is not the first utility that implements paired since a similar function is present in PDB-REDO (Joosten et al., 2014). Nevertheless, PAIREF provides additional features (e.g. complete cross-validation, modification of the structure protocol) and reports that naturally require more extensive input, and allows a user to make a more sophisticated decision.
2. Design and implementation
PAIREF is a command-line tool that can be installed as a module into the CCTBX toolbox (Grosse-Kunstleve et al., 2002) on various platforms (GNU/Linux, MS Windows). Currently, it has been developed in Python 2.7 (Hunter, 2007; Rossum, 1995) but is ready to move to Python 3. It depends on the following programs of the CCP4 software package (Winn et al., 2011): REFMAC5 (Murshudov et al., 2011), SFCHECK (Vaguine et al., 1999), MTZDUMP, SFTOOLS and BAVERAGE; and on the module pdbtools (Adams et al., 2010) from CCTBX. Input parameters can be specified in order to place the protocol under the full control of the user.
A typical command-line example for a PAIREF job is cctbx.python -m pairef --XYZIN starting_model_2-4A.pdb --HKLIN data_2A.mtz --HKLIN_UNMERGED data_2A_unmerged.mtz -i 2.4 -r 2.3,2.2,2.1,2.0, which executes refinements of the structure model starting_model_2-4A.pdb (previously refined at 2.4 Å) for a series of cutoffs (stepwise 2.3, 2.2, 2.1 and 2.0 Å). Specification of unmerged data (MTZ, unmerged Scalepack or XDS/XSCALE file types) is only required if comparison of CCwork with CC* (see below) should be enabled.
2.1. Parameters and algorithm
The algorithm implemented in PAIREF depends on the amount of data provided by the user. The minimal function of the program requires the following input files: structure model refined at the starting resolution (PDB or mmCIF format) and higher-resolution merged diffraction data in MTZ format which have the same free reflection flags as the data previously used in the (Fig. 1). Nevertheless, the minimal requirement is not sufficient for deep data analysis including statistics such as CC*, etc. The protocol can be further supplemented by the full-resolution unmerged data for calculating merging statistics, by the external restraints in format in the case where non-standard ligands are present and by the command file for REFMAC5 (alternatively generated by PDB-REDO) for better control of the structure Moreover, a definition of domains for translation–libration–screw (TLS) can be provided by the user. The program allows the selection of resolution shells (with a default width of 0.05 Å) and optional model modifications before the paired refinement.
Our paired REFMAC5 is an adaptation of the original protocol that has been performed with phenix.refine (Karplus & Diederichs, 2012; Afonine et al., 2012). Initially, the input files are checked using MTZDUMP and CCTBX for consistency. The model is then refined against the data up to resolution B (higher than A), and this model is compared with the original one – both against the data at resolution A (see Section 2.2). This step is then repeated from resolution B up to resolution C (higher than B) and reproduced again until the maximum limit is reached. CCwork and CCfree statistics are calculated using SFTOOLS (Karplus & Diederichs, 2012). Finally, merging statistics are calculated using the CCTBX library if unmerged diffraction data were provided.
protocol withAs an option, PAIREF provides a complete cross-validation protocol (Brünger, 1993; Jiang & Brünger, 1994) – also referred to as k-fold cross-validation (Luebben & Gruene, 2015) – to investigate the impact of the selection of free reflections. Here, the paired protocol is run in parallel for each selection individually. To remove the bias given by previous with a particular set of free reflections, a number of optional input model modifications prior to have been implemented: the perturbation of the atomic coordinates, the reset of atomic displacement parameters (ADPs) to a particular or average value and the addition of a fixed value to them (achieved by module pdbtools from CCTBX and BAVERAGE). In the final report, both the averaged statistics as well as the individual statistics for each selection are reported. Application of this protocol is demonstrated on a data set from cysteine dioxygenase (Section 3.3). The complete cross-validation requires the CCP4-style test set description in the input MTZ file, i.e. multiple free reflection labels must be present.
The program PAIREF does not have any decision-making routines and it remains up to the user to decide on the resolution cutoff based on the comprehensive analysis that was performed. Structure is a multiparametric calculation and the user should be aware of potential problems. For example, nonconvergent may result in misleading statistics and a suboptimal model (Tickle, 2011). One of the parameters that may potentially play a role is the FFT grid size (Drenth & Jeroen, 2010).
2.2. Program output and interpretation of results
Paired
does not reduce the problem of high-resolution cutoff estimation to a single monitoring statistic. Rather, a comprehensive data analysis is summarized on an HTML page. Here, various plots, tables and links to many intermediate files and log files are presented or easily accessible via hyperlinks.The first monitoring statistics reported by PAIREF are the differences in R values between the models refined at adjacent resolutions (both computed at the lower resolution to provide a valid comparison). A decrease in Rfree is expected in shells beneficial to the model quality. However, a constant Rfree and a simultaneous increase in Rwork are usually acceptable as well because these indicate less overfitting of the structure model (Karplus & Diederichs, 2012). Therefore, the next monitoring statistic is Rgap (Rgap = Rfree − Rwork) which is calculated at the starting resolution (corresponding to resolution A in Section 2.1) for all analyzed shells. This is an implementation of a previously published protocol (Winter et al., 2018). In the case of the complete cross-validation protocol, R values for each set of free reflections and average values are reported. Moreover, the standard deviations of R values of structure models refined using different free reflection sets are calculated (Kleywegt & Brünger, 1996).
However, the overall R values are not the only parameters to be taken into account when deciding on the high-resolution cutoff. The analysis is further supplemented by plots of Rwork, Rfree, CCwork and CCfree (CCwork and CCfree are correlation coefficients between experimental and calculated intensities) of the refined structure models at defined resolution. Since a perfect model gives an R value of 0.42 against random data (i.e. pure noise) – assuming non-tNCS (translational non-crystallographic symmetry) data from a non-twinned crystal (Evans & Murshudov, 2013) – a higher R value in the (current) high-resolution shell indicates either the involvement of high-resolution data without information content (the data are even worse than noise), or poor quality of the model, or the presence of tNCS.
When unmerged data are available, values of CC* are added to the CCwork and CCfree plots. Comparison of CC values (correlation coefficients) with CC* serves for direct linking of the data and structure model quality (Diederichs & Karplus, 2013; Karplus & Diederichs, 2012). CCwork or CCfree greater than CC* in a high-resolution shell indicates undesirable overfitting of the structure model as the calculated intensities agree with the observed data better than the (usually unavailable) true data. Owing to the independence of CC* on a model, its comparison with CCwork is just as informative as comparison with CCfree. However, the usage of CCwork should be preferred since it is based on much more data.
For additional information, PAIREF reports the as calculated using SFCHECK for each resolution cutoff. When all previous procedures are finished and unmerged diffraction data are available, the merging statistics are listed in a table and shown in graphs. Finally, the progress of the procedures is reported to check for convergence etc.
2.3. Distribution and documentation
Full documentation of PAIREF is available online at https://pairef.fjfi.cvut.cz and the program is distributed at https://pypi.org/project/pairef/.
3. Examples
The functionality and versatility of PAIREF have been thoroughly tested on a number of cases. Here, we selected six structures and data sets to demonstrate the broad application potential of the tool: simulated data for lysozyme from Gallus gallus (SIM) (Holton et al., 2014), and measured data for thermolysin from Bacillus thermoproteolyticus (TL) (Winter et al., 2018), a cysteine-bound complex of cysteine dioxygenase from Rattus norvegicus (CDO) (Karplus & Diederichs, 2012), endothiapepsin from Cryphonectria parasitica in complex with fragment B53 (EP) (Huschmann et al., 2016), interferon gamma from Paralichthys olivaceus (POLI) (Zahradník et al., 2018) and bilirubin oxidase from Myrothecium verrucaria (BO) (Koval' et al., 2019). All the results are available from https://doi.org/10.5281/zenodo.3687267.
A comprehensive summary of crystallographic data as well as the and 2. To be consistent with the previous results, the free reflection flags from the original data were preserved except for TL, because of inaccessibility.
are shown in Tables 1
‡Number of additional reflections suggested by paired results to be involved in the in contrast to the starting resolution. Added resolution range, in Å, is given in {} brackets. §Range where CC1/2 is significantly different from 0 at the 1:1000 level. |
‡For the BO data set, values for a resolution shell beyond the optimal cutoff are listed in angled brackets 〈〉. |
3.1. Simulated data set of lysozyme
The ability to generate artificial X-ray diffraction patterns based on a well defined `true' structure offers the possibility of monitoring the progress of paired
especially the convergence of the refined models towards the `true' structure.We generated one hundred diffraction images using a modified structure of lysozyme (data set SIM). At first, all alternative conformations were removed from the structure with the PDB entry 1h87 (originally determined at 1.72 Å resolution) (Girard et al., 2002). The data collection was simulated using MLFSOM (Holton et al., 2014) with a crystal-to-detector distance of 150 mm. MLFSOM also simulated global radiation damage for a beam of 8.4 × 1010 photons s−1 and 100 µm diameter, exposure of 0.1 s and a crystal size of 77.8 µm. Afterwards, the diffraction data set was processed using DIALS/AIMLESS (Evans & Murshudov, 2013; Winter et al., 2018) or XDS/XSCALE (Kabsch, 2010) up to a resolution of 1.20 Å, although the CC1/2 values become not significantly different from zero (at the 1:1000 level) at 1.35 Å resolution.
The input model for paired 2). In the final preparation step, several cycles of at the starting resolution (1.72 Å) against the processed simulated data were performed. In the next step, we performed the paired protocol using PAIREF.
was generated from the structure used for the generation of the diffraction images by perturbation of atomic coordinates by an average of 0.25 Å; the ADPs were set to their mean value (15 ÅStructure models refined against the simulated data set have considerably lower R values when compared with the other structures (based on real experimental data) mentioned later (Rfree= 0.071 for SIM versus Rfree= 0.195 for TL, both at 1.72 Å). This effect, caused by the simulated character of the data, was also observed in the original work by Holton et al. (2014). However, the trends of nearly all indicators of data quality are similar to those of the real cases [see Fig. 2(a)]. Based on the plot of stepwise differences in overall R values, we decided to estimate the high-resolution limit as 1.3 Å because the R values increase for resolution shells beyond that limit.
We monitored the root-mean-square deviation (RMSD) values (DeLano Scientific, 2017) calculated on all 1217 atoms of the simulated structure with respect to the original structure model [Fig. 2(c)]. A systematic decrease was observed for the atomic coordinates when reflections from an additional high-resolution shell were added to the up to 1.3 Å resolution. This is in agreement with the high-resolution cutoff based on the differences in overall R-values behaviour only. In general, the RMSD of ADP values calculated for all the atoms (see equation given in the supporting information) follow a similar but not identical trend. Moreover, they continue to decrease and converge to the `true' value even for the highest resolution shell which was later omitted from the data based on the other data quality indicators. As a result of our calculations, we suggest here application of a high-resolution cutoff at 1.3 Å when using our combination of programs and following our protocol. Similar results were also obtained using XDS/XSCALE for data processing.
3.2. Thermolysin
Successful application of paired B. thermoproteolyticus (Winter et al., 2018). In the original protocol, the structure was modified (perturbation of atomic positions) and refined at a defined high-resolution limit in the range from 1.80 to 1.50 Å. Model improvement was monitored on Rgap only, which decreased until 1.56 Å resolution. A further increase in the resolution did not cause a substantial change of Rgap.
was previously demonstrated on the of thermolysin (TL) fromTo reproduce most of the original procedures by Winter et al., the diffraction data were processed with xia2 (Winter, 2010) using DIALS/AIMLESS software. The structure of thermolysin (PDB entry 3n21; Behnen et al., 2012) was used as a starting model. The atomic coordinates were perturbed and all ADPs were generally set to their average value of 22 Å2 with phenix.pdbtools (Adams et al., 2010). A total of 30 cycles of were performed with REFMAC5 at a resolution of 1.80 Å. After that, ligands (peptide in the active site, three molecules of DMSO) and solvent were built in Coot (Emsley et al., 2010), refined with REFMAC5 and finally used in PAIREF to analyse the high-resolution cutoff.
We performed two PAIREF runs that added stepwise high-resolution shells with a width of 0.10 and 0.01 Å. Rfree has a decreasing trend up to 1.50 Å for the first run [Fig. 2(d)], which suggests that the data should be cut at this resolution. Moreover, the plot of Rgap [Fig. 2(f)] from the second run further confirms a good agreement between the previously published results and our calculations.
3.3. Cysteine dioxygenase
The cysteine-bound complex of cysteine dioxygenase from R. norvegicus (CDO) (Simmons et al., 2008) was the first macromolecular on which the paired protocol was demonstrated (Karplus & Diederichs, 2012). Although the conservative criterion for Rmeas suggests setting the high-resolution diffraction limit to 1.80 Å, having 〈I/σ(I)〉 higher than 2 suggests setting the limit to 1.60 Å, but paired proved that data are useful up to 1.42 Å. All was previously performed using phenix.refine (Afonine et al., 2012).
Here, we tried to reproduce the previous results in PAIREF which uses REFMAC5 as a structure program. We have reprocessed the original images with XDS. The input structure model was prepared according to the following protocol: the protein atomic positions of the unliganded CDO structure (PDB entry 2b5h; Simmons et al., 2006) were perturbed by an average of 0.25 Å with phenix.pdbtools; the ligand (cysteine persulfenate) was built manually with Coot. Subsequently, the model was refined with REFMAC5 at 2.00 Å resolution, solvent was added automatically using ARP/wARP (Lamzin & Wilson, 1993) followed by a manual check of the ligand and solvent and with REFMAC5. This model was later used as the input file for PAIREF to analyze the high-resolution shells with a width of 0.10 Å. Unlike the protocol published previously, solvent molecules were not automatically updated during paired refinement.
The differences of overall R values [Fig. 2(g)] indicate that the high-resolution diffraction limit may be set to 1.60 Å using our combination of software and free reflection set. However, the selection of free reflections may have an impact on the results and conclusions from paired therefore, we ran the second procedure of 20-fold cross-validation across all free reflection sets, as described in Section 2.1. The differences of overall Rfree averaged over the free sets are negative up to 1.50 Å resolution [Fig. 2(j)]. CC* remains higher than CCwork in the whole resolution range for all the refined models. Moreover, the trend of Rgap [Fig. 2(i)] shows a moderate decrease for higher resolution going up to 1.42 Å when shells with a width of 0.01 Å were analyzed in the third run of paired using the original free flag 0. To conclude, our calculations indicate that the data improve the model up to 1.50 Å resolution. This suggestion originates from the complete cross-validation protocol which should always be considered when deciding on the high-resolution cutoff.
3.4. Endothiapepsin in complex with fragment B53
In the cases reported above, the improvement of structure models using paired E. coli (improvement from 3.1 to 2.5 Å resolution) (Karplus & Diederichs, 2015). To demonstrate this effect using PAIREF, we reprocessed the diffraction data from the of endothiapepsin (EP) from C. parasitica in complex with fragment B53 (PDB entry 4y4g; Huschmann et al., 2016) using XDS. The data set originates from a fragment screening project; fragment B53 has a partial occupancy.
was shown on statistical criteria. However, the increase in information gained from the data may also be shown by the interpretability of electron-density maps. Such enhancement was already reported for the of the prokaryotic sodium channel pore (improvement from 4.0 to 3.5 Å resolution) and on the of the YfbU protein fromThe data were originally processed up to 1.44 Å resolution with an 〈I/σ(I)〉 value of 2 in the highest resolution shell (1.52–1.44 Å). Here, we tried to simulate the regular workflow of model building and structure We removed all solvent molecules including ligands from the deposited model. The atomic coordinates were perturbed as done previously, the ADPs were manually set to their mean value of 16 Å2. Subsequently, 15 cycles of using anisotropic ADPs were performed with REFMAC5. These procedures were later followed by PAIREF calculations up to a resolution of 1.05 Å. According to our results, the optimal high-resolution limit was set to 1.20 Å [Fig. 3(a)] since positive Rfree differences are observed for the higher resolution shells.
Inclusion of more intensities in the working data set considerably improved the quality of the omit map belonging to the partially occupied ligand [Fig. 3(c)]. In general, we expect that the greatest improvement in interpretability will occur for weak density features because the noise level of the map decreases due to improved phases resulting from a more accurate model. This will not significantly influence the observation of atoms with strong density. However, for a feature in the electron-density map that is close to the lower contour levels used in interpreting the map, having a bit less noise will have a higher impact on the reliability and interpretability of the electron-density map. In our case, this effect was observed in the stage of ligand and solvent building, which may be valuable especially in difficult cases and with low-occupied ligands.
3.5. Interferon gamma
All the above-mentioned cases are high-resolution crystal structures. The P. olivaceus (POLI) was previously determined at a medium resolution of 2.3 Å (Zahradník et al., 2018). Moreover, the data exhibited severe anisotropy. Resolution limits were estimated in the range from 2.26 to 2.71 Å, according to the criterion of 〈I/σ(I)〉 being higher than 1.5 in the highest resolution shell (Evans & Murshudov, 2013). The data were reprocessed in XDS up to 1.9 Å resolution. The deposited structure (PDB entry 6f1e; Zahradník et al., 2018) was refined using all of the reflections in the final step. However, we used the last model refined using work reflections only in our paired refinement.
of interferon gamma fromSeveral parameters were used to evaluate the high-resolution cutoff. Monitoring of Rfree differences suggests a high-resolution cutoff at 2.0 Å [see Fig. 3(d)]. The value of Rwork of the model refined at 1.9 Å calculated against the data in the highest resolution shell (2.0–1.9 Å) is high: 0.43 [Fig. 3(f)], i.e. it exceeds the R value of a perfect model refined against random data (see Section 2.2). We suggest omitting the highest resolution shell in further and cutting the data at 2.0 Å resolution. Poor CC* values in the high resolution are probably caused by the anisotropy of the diffraction data which affects the correlation between reflections. These results show that the decision on diffraction data resolution should not be based only on a single/certain value of data quality indicator, but on a more comprehensive evaluation of the available data.
3.6. Bilirubin oxidase
The choice of the structure PAIREF supports broad modification of structure protocols using a command file for REFMAC5, including modification of ligand libraries. To demonstrate this functionality, we have analyzed the of bilirubin oxidase in complex with ferricyanide (BO) (PDB entry 6i3j). The structure was previously refined at 2.59 Å resolution with 〈I/σ(I)〉 equal to 2 in the highest resolution shell (Koval' et al., 2019) as shown in Fig. 3(i).
program and parameters of are the most decisive tools in pairedWe have reprocessed the diffraction data up to a resolution of 2.3 Å with XDS. The last model originally refined using working reflections only was used as an input file for paired The library definitions for hexacyanoferrate, weighting matrix and several external harmonic restraints were supplied to the protocol (see the supporting information). In this case, no improvement in resolution can be expected according to PAIREF. Although the values of CC* are higher than CCwork and CCfree in the whole resolution range [Fig. 3(h)], an increase in Rfree values indicates that the original high-resolution cutoff was set reasonably [Fig. 3(g)].
To further prove this, we ran the paired 2. In this case, paired suggested the data should be cut at 2.6 Å resolution, which was the original conservative cutoff (see the supporting information).
protocol with 2.8 Å resolution as a starting resolution. At such low resolution, it was important to perform moderate atomic coordinate perturbation (mean shift 0.02 Å); the ADPs were set to their mean value of 35 ÅIn addition, we ran the paired PAIREF program as it is not implemented. Therefore, it remains the user's responsibility to perform such analysis. To that end, PAIREF provides direct links to input, output and log files from all calculation procedures.
protocol starting at 2.59 Å resolution which was not supplied with the external harmonic restraints. An apparent improvement up to 2.5 Å resolution was observed in the data quality indicators. However, lacking the important restraints led to unacceptable geometry of hexacyanoferrate molecules and of several amino acid residues (away from the active site) in the output files and could not be accepted as a positive result. Analysis of the geometry of the refined model is beyond the scope of the3.7. Impact of the model quality
We performed a limited analysis of the impact of the starting model quality on results from paired
We selected the EP and POLI data sets as examples of structures solved using and an experimental phasing method, respectively. Several models from different model building stages were used in the analysis.3.7.1. and the EP data set
We solved the structure using the Phaser (McCoy et al., 2007). The of penicillopepsin (54% identity, 67% similarity; PBD entry 2wea; Ding et al., 1998) was used as a search model. Subsequently, the protein chain was built automatically by ARP/wARP (Langer et al., 2008) at the starting resolution (1.45 Å). Altogether, we analyzed four stages of the model building: (i) model placed by (i.e. containing the penicillopepsin sequence), (ii) the protein chain built by ARP/wARP, (iii) the original model of the final structure (PDB entry 4y4g) without solvent and (iv) the final complete deposited model [Figs. 4(a)–4(d)]. We used an identical setup for all the paired protocols. Initially, the coordinates were perturbed by an average of 0.25 Å and the ADPs were set to their mean value, followed by 250 cycles at the starting resolution (required for convergence). Then, high-resolution shells with a width of 0.05 Å were added stepwise (see the supporting information).
method withSurprisingly, utilization of the data in the whole resolution range (up to 1.10 Å) is suggested when using a distant protein model correctly placed in the ARP/wARP. Use of a protein model with no solvent molecules suggests the application of a high-resolution cutoff at 1.25 Å and for the most complete model at 1.20 Å.
In contrast to this, improvement only up to 1.30 Å is observed using the model after complete protein rebuilding with3.7.2. Experimental phasing and the POLI data set
The P. olivaceus was solved using SAD phasing. The following stages of model building were analysed: a poly-Ala model from SHELXE (Sheldrick, 2002), a complete protein model without solvent from PHENIX AutoBuild (Terwilliger et al., 2008) [Figs. 4(e) and 4(f)] and the model prior to the final [Fig. 3(d)] at the starting resolution (2.3 Å). Here we used optimized parameters of the paired protocol for each specific model (see the supporting information).
of interferon gamma fromThe use of incomplete models in paired
suggested the application of a high-resolution cutoff of 2.2 Å, while the use of the most complete model a cutoff of 2.0 Å. Given both examples mentioned above, it can be stated that the model quality and completeness may play a significant role in the results from paired refinement.4. Limitations and further development
Amongst the hundreds of trials we performed, we did not register any failure of PAIREF itself. However, in a few cases, the external programs may fail to report an appropriate value, which may cause the crash of the PAIREF run. These cases were observed mostly at unreasonable resolution, e.g. the third or fourth resolution shell that should have already been omitted, or during analysis of very thin shells (e.g. 0.01 Å).
Results of paired REFMAC5 version). In most of the cases mentioned above, a possible improvement in model accuracy owing to the use of higher-resolution data was detected using PAIREF. However, no improvement from the conservative cutoff was observed in the case of bilirubin oxidase.
are strongly influenced by the structure protocol (and in some cases also by the specificThe main focus of our further development will be the implementation of structure phenix.refine. Most of the procedures cannot be parallelized. Nevertheless, the parallelization of the complete cross-validation protocol is planned to significantly reduce computational time. Moreover, the inclusion of other monitoring statistics – e.g. Rcomplete (Luebben & Gruene, 2015) – in the final report is under development.
using5. Discussion
In macromolecular PAIREF program is a command-line tool that performs such an analysis and creates a compact report for users to make a self-contained decision on the data limit.
the maximum amount of valuable data should be used to obtain the best possible structural models. Hence, evaluation of data significance should be based on novel approaches. This involves the implementation of correlation coefficients and simultaneous monitoring of trends of several statistics that are directly linked to the quality of the refined model. Paired is currently generally accepted as the optimal protocol for the determination of high-resolution cutoff. TheIn one of the examples documented here, we first analyzed the progress of the paired PAIREF functionality on data that have been artificially generated from a known structure. This structure later served as a target to monitor the convergence of the refined models. Continuous improvement in agreement between the original structure and models from paired was observed in a range where our criteria suggested acceptance of further data. Here, the RMSD calculations showed that use of the high-resolution cutoff suggested by paired produces models closest to the truth. The gap between CCwork and CC* visible for all projects except SIM corresponds to the R-value gap discussed by Holton et al. (2014), and is due to deficiencies in modelling the experiment.
procedure as well as theWe also tested the program on five other real cases, some of them previously used in paired et al. (2018) and CDO in the work by Karplus & Diederichs (2012)] and the results obtained are in good agreement with the original ones. Slight differences could be caused by the use of a newer version of REFMAC5 (in the case of TL), or by the utilization of other software and the absence of an automatic solvent update during paired (in the case of CDO).
In four cases, we showed that the model could be further improved by the use of data beyond conservative cutoffs. Our program is able to successfully reproduce two particular paired protocols that were published previously [TL in the work by WinterIn the case of bilirubin oxidase, an agreement in the high-resolution estimation between the conservative and paired I/σ(I)〉 and CC1/2 are in the ranges from 0.1 to 1.7 and from 0.027 to 0.524, respectively, all in the highest accepted resolution shell. Therefore, it is clear that a resolution cutoff based purely on certain values of these statistics does not correspond to the information content in the last or next additional resolution shell, as shown in previous works (Karplus & Diederichs, 2012, 2015; Diederichs & Karplus, 2013; Evans & Murshudov, 2013; Winter et al., 2018).
approach was observed. In all reported cases, the values of 〈The addition of high-resolution reflections suggested by the paired
results influences the amount of experimental data used in structure as well as the overall agreement of the model to the data. In addition, it produces cleaner and more detailed maps which enable further manual improvement and removal of model errors by In the case of the data set from fragment screening (EP), we demonstrated that the involvement of valid data from higher resolution shells may have a positive impact on the quality of the electron-density map. Such an effect is clearly useful for low-occupancy ligands, partially disordered regions, alternative positions or low-resolution data.We tested the influence of model quality on the results from paired Paralichthys olivaceus. In these two cases, we observed that the use of a poor starting model suggested a lower high-resolution cutoff than the use of the most complete models. This notwithstanding, the use of a (partially) incorrect model may also result in a misleading suggestion, e.g. inclusion of the whole resolution range. Therefore, the input structure model should be selected carefully; paired is particularly sensible in the final stage of structure refinement.
We randomly chose a distant model for of the structure of endothiapepsin and simulated the procedure of structure building and We also used three models from various stages of of interferon gamma fromPAIREF worked well for the examples described using this general protocol: (i) processing of diffraction data at (almost) the full resolution; (ii) provisional resolution cutoff according to a conservative criterion, structure solution, model building and (iii) paired with sufficient model quality at a later stage of model refinement.
With the introduction of paired
into X-ray crystallography, the high-resolution diffraction limit has gained a new meaning, as the only criterion for the data cutoff is now the `additional value' of the data in model Following the current trends in diffraction data evaluation, resolution cannot be directly related to a specific value of the conventional indicators of diffraction data quality.Reflections that were added during the paired I/σ(I)〉 is lower, Rmeas higher and CC1/2 lower. Nonetheless, they may represent a significant portion of the data. For most of the cases reported above, the reflections added through paired account for more than 40% of all data. This of course is highly dependent on the conservative criteria that were used previously, before the paired protocol was applied. Moreover, paired has shown its importance for the improvement of structure models or even interpretability of electron-density maps.
protocol generally represent data with the lowest information content. Since they come from the highest resolution shells, their 〈Supporting information
Link https://doi.org/10.5281/zenodo.3687267
Paired under control of PAIREF - examples
Supporting data. DOI: https://doi.org/10.1107/S2052252520005916/mf5044sup1.pdf
Acknowledgements
We thank Andrew Karplus for comments on the manuscript, James Holton for the discussion regarding the simulated data set SIM, Jan Stránský for development consultation, Jan Wollenhaupt and Manfred S. Weiss for providing the EP data set, and Tomáš Koval' and Leona Švecová for providing the BO data set.
Funding information
This work was supported by the Ministry of Education, Youth and Sports CR – projects CAAS (grant No. CZ.02.1.01/0.0/0.0/16_019/0000778 to the to Faculty of Nuclear Sciences and Physical Engineering, Czech Technical University in Prague); ELIBIO (grant No. CZ.02.1.01/0.0/0.0/15_003/0000447 to the Institute of Biotechnology AS CR) and BIOCEV (grant No. CZ.1.05/1.1.00/02.0109 to the Institute of Biotechnology AS CR), from the ERDF fund; the Czech Science Foundation (grant No. 18-10687S to the Institute of Biotechnology AS CR); the Czech Academy of Sciences (grant No. 86652036); and by the Grant Agency of the Czech Technical University in Prague (grant No. SGS19/189/OHK4/3T/14).
References
Adams, P. D., Afonine, P. V., Bunkóczi, G., Chen, V. B., Davis, I. W., Echols, N., Headd, J. J., Hung, L.-W., Kapral, G. J., Grosse-Kunstleve, R. W., McCoy, A. J., Moriarty, N. W., Oeffner, R., Read, R. J., Richardson, D. C., Richardson, J. S., Terwilliger, T. C. & Zwart, P. H. (2010). Acta Cryst. D66, 213–221. Web of Science CrossRef CAS IUCr Journals Google Scholar
Afonine, P. V., Grosse-Kunstleve, R. W., Echols, N., Headd, J. J., Moriarty, N. W., Mustyakimov, M., Terwilliger, T. C., Urzhumtsev, A., Zwart, P. H. & Adams, P. D. (2012). Acta Cryst. D68, 352–367. Web of Science CrossRef CAS IUCr Journals Google Scholar
Behnen, J., Köster, H., Neudert, G., Craan, T., Heine, A. & Klebe, G. (2012). ChemMedChem. 7, 248–261. Web of Science CrossRef CAS PubMed Google Scholar
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235–242. Web of Science CrossRef PubMed CAS Google Scholar
Brünger, A. T. (1993). Acta Cryst. D49, 24–36. CrossRef Web of Science IUCr Journals Google Scholar
DeLano Scientific (2017). The PyMOL Molecular Graphics System, version 2.0. Schrödinger, LLC. Google Scholar
Diederichs, K. & Karplus, P. A. (2013). Acta Cryst. D69, 1215–1222. Web of Science CrossRef CAS IUCr Journals Google Scholar
Ding, J., Fraser, M. E., Meyer, J. H., Bartlett, P. A. & James, M. N. G. (1998). J. Am. Chem. Soc. 120, 4610–4621. Web of Science CrossRef CAS Google Scholar
Drenth, J. & Jeroen, M. (2010). Principles of Protein X-ray Crystallography, pp. 248–278. New York: Springer. Google Scholar
Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. (2010). Acta Cryst. D66, 486–501. Web of Science CrossRef CAS IUCr Journals Google Scholar
Evans, P. R. (2011). Acta Cryst. D67, 282–292. Web of Science CrossRef CAS IUCr Journals Google Scholar
Evans, P. R. & Murshudov, G. N. (2013). Acta Cryst. D69, 1204–1214. Web of Science CrossRef CAS IUCr Journals Google Scholar
Girard, E., Chantalat, L., Vicat, J. & Kahn, R. (2002). Acta Cryst. D58, 1–9. CrossRef CAS IUCr Journals Google Scholar
Grosse-Kunstleve, R. W., Sauter, N. K., Moriarty, N. W. & Adams, P. D. (2002). J. Appl. Cryst. 35, 126–136. Web of Science CrossRef CAS IUCr Journals Google Scholar
Holton, J. M., Classen, S., Frankel, K. A. & Tainer, J. A. (2014). FEBS J. 281, 4046–4060. Web of Science CrossRef CAS PubMed Google Scholar
Hunter, J. D. (2007). Comput. Sci. Eng. 9, 90–95. Web of Science CrossRef Google Scholar
Huschmann, F. U., Linnik, J., Sparta, K., Ühlein, M., Wang, X., Metz, A., Schiebel, J., Heine, A., Klebe, G., Weiss, M. S. & Mueller, U. (2016). Acta Cryst. F72, 346–355. Web of Science CrossRef IUCr Journals Google Scholar
Jiang, J.-S. & Brünger, A. T. (1994). J. Mol. Biol. 243, 100–115. CrossRef CAS PubMed Web of Science Google Scholar
Joosten, R. P., Long, F., Murshudov, G. N. & Perrakis, A. (2014). IUCrJ, 1, 213–220. Web of Science CrossRef CAS PubMed IUCr Journals Google Scholar
Kabsch, W. (2010). Acta Cryst. D66, 125–132. Web of Science CrossRef CAS IUCr Journals Google Scholar
Karplus, P. A. & Diederichs, K. (2012). Science, 336, 1030–1033. Web of Science CrossRef CAS PubMed Google Scholar
Karplus, P. A. & Diederichs, K. (2015). Curr. Opin. Struct. Biol. 34, 60–68. Web of Science CrossRef CAS PubMed Google Scholar
Kleywegt, G. J. & Brünger, A. T. (1996). Structure, 4, 897–904. CrossRef CAS PubMed Web of Science Google Scholar
Koval', T., Švecová, L., Østergaard, L. H., Skalova, T., Dušková, J., Hašek, J., Kolenko, P., Fejfarová, K., Stránský, J., Trundová, M. & Dohnálek, J. (2019). Sci. Rep. 9, 13700. Web of Science PubMed Google Scholar
Lamzin, V. S. & Wilson, K. S. (1993). Acta Cryst. D49, 129–147. CrossRef CAS Web of Science IUCr Journals Google Scholar
Langer, G., Cohen, S. X., Lamzin, V. S. & Perrakis, A. (2008). Nat. Protoc. 3, 1171–1179. Web of Science CrossRef PubMed CAS Google Scholar
Luebben, J. & Gruene, T. (2015). Proc. Natl Acad. Sci. 112, 8999–9003. Web of Science CrossRef CAS PubMed Google Scholar
McCoy, A. J., Grosse-Kunstleve, R. W., Adams, P. D., Winn, M. D., Storoni, L. C. & Read, R. J. (2007). J. Appl. Cryst. 40, 658–674. Web of Science CrossRef CAS IUCr Journals Google Scholar
McNicholas, S., Potterton, E., Wilson, K. S. & Noble, M. E. M. (2011). Acta Cryst. D67, 386–394. Web of Science CrossRef CAS IUCr Journals Google Scholar
Murshudov, G. N., Skubák, P., Lebedev, A. A., Pannu, N. S., Steiner, R. A., Nicholls, R. A., Winn, M. D., Long, F. & Vagin, A. A. (2011). Acta Cryst. D67, 355–367. Web of Science CrossRef CAS IUCr Journals Google Scholar
Read, R. J., Oeffner, R. D. & McCoy, A. J. (2020). Acta Cryst. D76, 19–27. Web of Science CrossRef IUCr Journals Google Scholar
Rossum, G. van (1995). Python Tutorial. Amsterdam: Centrum voor Wiskunde en Informatica. Google Scholar
Sheldrick, G. M. (2002). Z. Kristallogr. 217, 644–650. Web of Science CrossRef CAS Google Scholar
Simmons, C. R., Krishnamoorthy, K., Granett, S. L., Schuller, D. J., Dominy, J. E., Begley, T. P., Stipanuk, M. H. & Karplus, P. A. (2008). Biochemistry, 47, 11390–11392. Web of Science CrossRef PubMed CAS Google Scholar
Simmons, C. R., Liu, Q., Huang, Q., Hao, Q., Begley, T. P., Karplus, P. A. & Stipanuk, M. H. (2006). J. Biol. Chem. 281, 18723–18733. Web of Science CrossRef PubMed CAS Google Scholar
Terwilliger, T. C., Grosse-Kunstleve, R. W., Afonine, P. V., Moriarty, N. W., Zwart, P. H., Hung, L.-W., Read, R. J. & Adams, P. D. (2008). Acta Cryst. D64, 61–69. Web of Science CrossRef CAS IUCr Journals Google Scholar
Tickle, I. (2011). Number of cycles in REFMAC. https://www.mail-archive.com/ccp4bb@jiscmail.ac.uk/msg22423.html. Google Scholar
Vaguine, A. A., Richelle, J. & Wodak, S. J. (1999). Acta Cryst. D55, 191–205. Web of Science CrossRef CAS IUCr Journals Google Scholar
Winn, M. D., Ballard, C. C., Cowtan, K. D., Dodson, E. J., Emsley, P., Evans, P. R., Keegan, R. M., Krissinel, E. B., Leslie, A. G. W., McCoy, A., McNicholas, S. J., Murshudov, G. N., Pannu, N. S., Potterton, E. A., Powell, H. R., Read, R. J., Vagin, A. & Wilson, K. S. (2011). Acta Cryst. D67, 235–242. Web of Science CrossRef CAS IUCr Journals Google Scholar
Winter, G. (2010). J. Appl. Cryst. 43, 186–190. Web of Science CrossRef CAS IUCr Journals Google Scholar
Winter, G., Waterman, D. G., Parkhurst, J. M., Brewster, A. S., Gildea, R. J., Gerstel, M., Fuentes-Montero, L., Vollmar, M., Michels-Clark, T., Young, I. D., Sauter, N. K. & Evans, G. (2018). Acta Cryst. D74, 85–97. Web of Science CrossRef IUCr Journals Google Scholar
Zahradník, J., Kolářová, L., Pařízková, H., Kolenko, P. & Schneider, B. (2018). Fish Shellfish Immunol. 79, 140–152. Web of Science PubMed Google Scholar
This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.