PDB2INS: bridging the gap between small-molecule and macromolecular refinement
The open-source Python program PDB2INS is designed to prepare a .ins file for refinement with SHELXL [Sheldrick (2015). Acta Cryst. C71, 3–8], taking atom coordinates and other information from a Protein Data Bank (PDB)-format file. If PDB2INS is provided with a four-character PDB code, both the PDB file and the accompanying mmCIF-format reflection data file (if available) are accessed via the internet from the PDB public archive [Read et al. (2011). Structure, 19, 1395–1412] or optionally from the PDB_REDO server [Joosten, Long, Murshudov & Perrakis (2014). IUCrJ, 1, 213–220]. The SHELX-format .ins (refinement instructions and atomic coordinates) and .hkl (reflection data) files can then be generated without further user intervention, appropriate restraints etc. being added automatically. PDB2INS was tested on the 23 974 X-ray structures deposited in the PDB between 2008 and 2018 that included reflection data to 1.7 Å or better resolution in a recognizable format. After creating the two input files for SHELXL without user intervention, ten cycles of conjugate-gradient least-squares refinement were performed. For 96% of these structures PDB2INS and SHELXL completed successfully without error messages.
Historically, computer methods for crystal structure refinement developed relatively independently for inorganic, organometallic and organic structures on the one hand and biological macromolecules on the other. This resulted in many incompatibilities involving file formats and nomenclature. SHELXL (Sheldrick, 2015; https://shelx.uni-goettingen.de/), probably the most widely used program for small-molecule refinement, has some features – e.g. the estimation of least-squares standard deviations for the refined parameters, the ease of handling complicated disorders and non-merohedral twins, and the powerful concept of `free variables' – that might be useful for macromolecular refinement when high-resolution data are available. For an example of a macromolecular refinement in which the estimation of least-squares standard deviations played a decisive role see Köpfer et al. (2014). SHELX was written for small-molecule refinement in the early 1970s. Major extensions in the 1990s (e.g. the introduction of residues and the removal of restrictions on the number of atoms) first made it possible to use it for macromolecular refinement (Sheldrick, 1993; Sheldrick & Schneider, 1997), but the extensive reformatting required was still an impediment. The adoption of CIF format might have simplified this problem, but unfortunately mmCIF and small-molecule CIF are hardly compatible; for example, even the unit-cell dimensions have different names.
The original implementation of the least-squares refinement in SHELXL was based closely on a scheme proposed by Cruickshank (1969). SHELXL uses a conventional structure factor summation rather than a fast Fourier transform and implements the calculation of Rcomplete (Luebben & Gruene, 2015) as well as Rfree (Brünger, 1992). It offers a choice of full- or blocked-matrix refinement or conjugate-gradient solution of the least-squares normal equations (Konnert & Hendrickson, 1980). It may be used for crystal structure refinement against single-wavelength or Laue X-ray, neutron or electron diffraction data (Gruene et al., 2014; Clabbers et al., 2019) and can also handle merohedral and non-merohedral twins.
The Python program PDB2INS is designed to automate setting up a SHELXL refinement starting from a macromolecular structure in PDB format (https://www.rcsb.org/, https://pdbj.org/, https://www.ebi.ac.uk/pdbe/ and https://pdb-redo.eu/). It replaces the Fortran program SHELXPRO (Sheldrick & Schneider, 1997) that was originally distributed with SHELX for this purpose.
If PDB2INS is provided with a four-character PDB code, both the PDB file and the accompanying mmCIF-format reflection data file (if available) are accessed via the internet from the PDB public archive (Read et al., 2011) or optionally from the PDB_REDO server (Joosten et al., 2014). The SHELX-format .ins (refinement instructions and atomic coordinates) and .hkl (reflection data) files can then be generated without further user intervention, appropriate restraints etc. being added automatically.
An atom in a PDB file (Dutta et al., 2009) must have a unique combination of chain identifier (one character), residue sequence number (up to four digits), alternative location character and atom name (up to four characters), and in addition it has a residue name (up to three characters) and may have a one-character insertion code. Residue names and numbers considerably simplify the application of restraints in SHELXL refinements, also for small molecules, because the same names can be used for atoms in similar residues. For example, the instruction
FLAT_TOL C1 > C7
could be used to restrain the carbon atoms in each toluene residue to lie in a plane. The original small-molecule approach of using different names for each atom would involve much more typing and is less intuitive and more error prone. Residue names and numbers are set by RESI instructions such as
RESI TOL 21
which would be followed by a residue consisting of one toluene molecule. Such RESI instructions remain in force until the next RESI instruction is read. Several residues with different residue numbers may have the same residue name but not vice versa. Residue numbers for SHELXL must be between −999 and 9999 inclusive. Chain identifiers were needed for compatibility with the PDB but were first introduced into SHELXL in 2016. This required a major reorganization of the code and caused incompatibilities with several legacy programs. Chain identifiers are required for very large structures but can be useful, even for small molecules, when there are several similar molecules in the asymmetric unit. The residue numbers in the RESI instructions are then extended to include the chain IDs, e.g.
RESI TOL A:21
for residue 21 in chain A. For SHELX the chain identifiers may only be upper- or lower-case letters, digits, or the blank character, so there are only 63 different possible chain identifiers. This is enough except for the very largest structures in the PDB. Note that chain identifiers are an exception to the usual SHELX rule that upper- and lower-case letters are treated as equivalent.
SHELXL expects the space-group symmetry to be defined by the coordinates of the general position rather than the space-group name. This permits the use of non-standard settings. PDB2INS uses a dictionary approach in which the symmetry generators (Fischer & Koch, 2005) are stored for each non-centrosymmetric space-group symbol and used to generate the SHELX-format SYMM instructions by iterative multiplication. The open-source Python module SPAGSYDATA used by PDB2INS to do this is available from https://github.com/av-luebben/spagsydata.
For example, the space group R3 is defined by
SYMM -Y, X-Y, Z
SYMM -X+Y, -X, Z
on hexagonal axes or
SYMM Z, X, Y
SYMM Y, Z, X
on primitive rhombohedral axes.
SHELXL refines fractional coordinates rather than the Cartesian coordinates used in the PDB, so PDB2INS applies the appropriate transformations to the atomic coordinates and atomic displacement parameters.
PDB2INS includes HFIX instructions in the .ins file for generating hydrogen atoms in the form of comments, each prefaced with REM, e.g. for a valine residue
REM HFIX_VAL 43 N
REM HFIX_VAL 13 CA CB
REM HFIX_VAL 33 CG1 CG2
Later in the refinement the user can delete `REM ' to activate the hydrogen-atom generation. Such riding hydrogen atoms do not change the number of parameters refined. So that missing atoms in a side chain do not cause an error when SHELXL later tries to generate hydrogen atoms, PDB2INS adds `HFIX 0' instructions that can be edited when the structure becomes more complete.
PDB2INS automatically adds the restraints on 1,2- and 1,3-distances corresponding to the restraints on bond distances and angles in amino acids given by Engh & Huber (1991). In addition PDB2INS uses a library of restraints for other common residues that were generated using the Grade server (https://grade.globalphasing.org/). Alternatively, suitable geometrical restraints can be generated by PROSMART (Nicholls et al., 2012). PDB2INS uses interatomic distances to detect disulfide bridges and C- and N-terminal residues and adds appropriate geometrical restraints and HFIX instructions. SHELXL uses planarity restraints rather than torsion angle restraints to ensure peptide planarity. A side effect of this is that it is possible for a trans-peptide to refine with SHELXL to cis or vice versa if the data strongly indicate that this is required (Stenkamp, 2005)
Anisotropic refinement requires six parameters per atom instead of one for an isotropic refinement. This is too many for most macromolecules, although the rigid-bond restraint DELU may be used to make the motions of bonded atoms along the bond joining them more equal, and the SIMU restraint to make the Uij anisotropic displacement parameters of two atoms more equal is particularly useful for disordered models in which atoms overlap. However, the more recent RIGU extended rigid-bond model (Thorn et al., 2012) leads to a substantial reduction in the number of effective parameters. The RIGU model simply assumes that the relative motion of two bonded atoms is at right angles to the bond joining them, reducing the effective number of parameters per atom to three. Applying RIGU to 1,3-distances leads to a further reduction. An additional constraint (XNPD) imposes a minimum value for the motion of an atom in any direction, preventing displacement ellipsoids from becoming non-positive definite. Taken together, these two options that can be applied globally with the instructions RIGU and XNPD 0.01 enable structures to be refined anisotropically at appreciably lower resolution than previously possible. PDB2INS always writes these instructions to the .ins file, but RIGU only takes effect when the atoms are made anisotropic with the instruction ANIS.
SHELXL stores the scattering factors of the first 98 elements in the periodic table, recognizes the wavelengths of the more common in-house sources (Ga, Cu, Mo, Ag and In), and sets the absorption and dispersion coefficients for them automatically. PDB2INS sets up DISP instructions giving the values of f′, f′′ and μ generated automatically from Kissel tables (Roy et al., 1993) using the given wavelength. For neutron diffraction, the user must insert a NEUT instruction. For naturally occurring elements, the average isotopic distribution is assumed for the scattering lengths; for synthetic isotopes, the most common isotope is assumed. For other isotopes, the user must insert the appropriate SFAC instruction. Starting with SHELXL2019/1, if the wavelength is shorter than 0.1 Å the reflection data are assumed to be electron diffraction data and electron scattering factors generated using the Mott–Bethe formula are used. In all cases the scattering factors and dispersion corrections may be set by hand by editing the .ins file to include appropriate SFAC instructions.
The SHELX reflection data file (.hkl) was originally designed for 80-column punched cards. There is one reflection per line in this fixed-format text file, with the structure
The reflection indices h, k, l are right-justified integers. For the intensities R and their standard deviations S, the position of the decimal point determines how these floating-point numbers are read. If the decimal point is missing these numbers are read as right-justified integers and then converted to floating point. If the .ins file ends with HKLF 4, R and S are intensities and their standard deviations; if it ends in HKLF 3, they are F and σ(F). The four characters B were historically the batch number (e.g. for Weissenberg films). Now this is normally −1 for a free-R reflection and +1 or absent for the reflections used for refinement. The program mtz2hkl (Grune, 2008) may be used to convert a CCP4 .mtz reflection data file to SHELX .hkl format.
PDB2INS is written in object-oriented Python 2.7. PyInstaller (https://www.pyinstaller.org/) was used to compile it into stand-alone executables. PDB2INS is open source and may be downloaded as part of CCP4 (Winn et al., 2014; https://www.ccp4.ac.uk/), from the SHELX server (https://shelx.uni-goettingen.de/) or from Git-Hub (https://github.com/av-luebben/pdb2ins), where the code was published. In addition to the command-line version, a free graphical user interface (GUI) version that is designed to help inexperienced users may be downloaded from https://github.com/av-luebben/PDB2INSGUI or from the SHELX server. PDB2INS is available for 64-bit Linux, MacOSX and Windows systems. The GUI was written using TkInter (Tcl/Tk), the de facto standard graphical interface for Python (https://wiki.python.org/moin/TkInter).
Further information on using PDB2INS may be obtained by typing PDB2INS --help. After preparing the input files with the help of PDB2INS, SHELXL is usually run from the command line, e.g.
which reads the .ins and .hkl input files. This produces a listing file .lst and an updated instruction and structure file .res, and optionally a PDB-format file .pdb and a .fcf file containing observed and calculated structure factors. Coot (Emsley et al., 2010) may be used to inspect the results of the refinement. The .res file is copied to a .ins file for the next refinement job; often it will be necessary to edit it to include additional disorder components specified by PART numbers etc.
PDB2INS was tested on 23 974 data sets deposited between 2008 and 2018 in the PDB with a resolution of 1.7 Å or better. Only 4.0% (964 data sets) displayed any problems. Details are given in Table 1. For the remaining 96% the SHELXL refinement was successful without needing to make any changes to the .ins and .hkl files written by PDB2INS. Potential problems caused by insertion codes in the PDB file were avoided by renumbering the residues. In general the R factors obtained using SHELXL tend to be slightly higher than those obtained using Refmac (Murshudov et al., 2011) or Phenix (Adams et al., 2010), mainly as a result of the less sophisticated Babinet bulk solvent model (Moews & Kretsinger, 1975) still employed by SHELXL.
We are grateful to Tim Gruene, Paul Emsley and many other SHELX users for help in testing PDB2INS and suggesting improvements.
Adams, P. D., Afonine, P. V., Bunkóczi, G., Chen, V. B., Davis, I. W., Echols, N., Headd, J. J., Hung, L.-W., Kapral, G. J., Grosse-Kunstleve, R. W., McCoy, A. J., Moriarty, N. W., Oeffner, R., Read, R. J., Richardson, D. C., Richardson, J. S., Terwilliger, T. C. & Zwart, P. H. (2010). Acta Cryst. D66, 213–221. Web of Science CrossRef CAS IUCr Journals Google Scholar
Brünger, A. T. (1992). Nature, 355, 472–475. PubMed Web of Science Google Scholar
Clabbers, M. T. B., Gruene, T., van Genderen, E. & Abrahams, J. P. (2019). Acta Cryst. A75, 82–93. Web of Science CSD CrossRef IUCr Journals Google Scholar
Cruickshank, D. W. J. (1969). Crystallographic Computing, edited by F. R. Ahmed, S. R. Hall & C. P. Huber, pp. 187–197. Copenhagen: Munksgaard. Google Scholar
Dutta, S., Burkhardt, K., Young, J., Swaminathan, G. J., Matsuura, T., Henrick, K., Nakamura, H. & Berman, H. M. (2009). Mol. Biotechnol. 42, 1–13. Web of Science CrossRef PubMed CAS Google Scholar
Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. (2010). Acta Cryst. D66, 486–501. Web of Science CrossRef CAS IUCr Journals Google Scholar
Engh, R. A. & Huber, R. (1991). Acta Cryst. A47, 392–400. CrossRef CAS Web of Science IUCr Journals Google Scholar
Fischer, W. & Koch, E. (2005). International Tables for Crystallography, Vol. A, Space-Group Symmetry, 5th ed., edited by Th. Hahn, pp. 810–811. Dordrecht: Springer Netherlands. Google Scholar
Gruene, T., Hahn, H. W., Luebben, A. V., Meilleur, F. & Sheldrick, G. M. (2014). J. Appl. Cryst. 47, 462–466. Web of Science CrossRef CAS IUCr Journals Google Scholar
Grune, T. (2008). J. Appl. Cryst. 41, 217–218. Web of Science CrossRef CAS IUCr Journals Google Scholar
Joosten, R. P., Long, F., Murshudov, G. N. & Perrakis, A. (2014). IUCrJ, 1, 213–220. Web of Science CrossRef CAS PubMed IUCr Journals Google Scholar
Konnert, J. H. & Hendrickson, W. A. (1980). Acta Cryst. A36, 344–350. CrossRef CAS IUCr Journals Web of Science Google Scholar
Köpfer, D. A., Song, C., Gruene, T., Sheldrick, G. M., Zachariae, U. & de Groot, B. L. (2014). Science, 346, 352–355. Web of Science PubMed Google Scholar
Luebben, J. & Gruene, T. (2015). Proc. Natl Acad. Sci. USA, 112, 8999–9003. Web of Science CrossRef CAS PubMed Google Scholar
Moews, P. & Kretsinger, R. (1975). J. Mol. Biol. 91, 201–225. CrossRef PubMed CAS Web of Science Google Scholar
Murshudov, G. N., Skubák, P., Lebedev, A. A., Pannu, N. S., Steiner, R. A., Nicholls, R. A., Winn, M. D., Long, F. & Vagin, A. A. (2011). Acta Cryst. D67, 355–367. Web of Science CrossRef CAS IUCr Journals Google Scholar
Nicholls, R. A., Long, F. & Murshudov, G. N. (2012). Acta Cryst. D68, 404–417. Web of Science CrossRef CAS IUCr Journals Google Scholar
Read, R. J., Adams, P. D., Arendall, W. B. III, Brunger, A. T., Emsley, P., Joosten, R. P., Kleywegt, G. J., Krissinel, E. B., Lütteke, T., Otwinowski, Z., Perrakis, A., Richardson, J. S., Sheffler, W. H., Smith, J. L., Tickle, I. J., Vriend, G. & Zwart, P. H. (2011). Structure, 19, 1395–1412 Web of Science CrossRef CAS PubMed Google Scholar
Roy, S. C., Pratt, R. H. & Kissel, L. (1993). Radiat. Phys. Chem. 41, 725–738. CrossRef CAS Web of Science Google Scholar
Sheldrick, G. M. (1993). Crystallographic Computing 6, edited by H. D. Flack, L. Párkányi & K. Simon, pp. 111–122. IUCr/Oxford University Press. Google Scholar
Sheldrick, G. M. (2015). Acta Cryst. C71, 3–8. Web of Science CrossRef IUCr Journals Google Scholar
Sheldrick, G. M. & Schneider, T. R. (1997). Methods Enzymol. 277, 319–343. CrossRef PubMed CAS Web of Science Google Scholar
Stenkamp, R. E. (2005). Acta Cryst. D61, 1599–1602. Web of Science CrossRef CAS IUCr Journals Google Scholar
Thorn, A., Dittrich, B. & Sheldrick, G. M. (2012). Acta Cryst. A68, 448–451. Web of Science CrossRef CAS IUCr Journals Google Scholar
Winn, M. D., Ballard, C. C., Cowtan, K. D., Dodson, E. J., Emsley, P., Evans, P. R., Keegan, R. M., Krissinel, E. B., Leslie, A. G. W., McCoy, A., McNicholas, S. J., Murshudov, G. N., Pannu, N. S., Potterton, E. A., Powell, H. R., Read, R. J., Vagin, A. & Wilson, K. S. (2011). Acta Cryst. D67, 235–242. Web of Science CrossRef CAS IUCr Journals Google Scholar
This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.