Local and global analysis of macromolecular Atomic Displacement Parameters

This paper describes the global and local analyses of Atomic Displacement Parameters (ADP) of macromolecules solved and refined using X-ray crystallography method. It is shown that the distribution of ADPs follows the (mixture of) Shifted Inverse Gamma distribution(s). The parameters of the mixture of SIGDs are estimated using Expectation/Maximisation methods. In addition, a method for resolution and individual ADP dependent local analysis of neighbouring atoms has been designed. This method facilitates the detection of the mismodelled atoms and indicates potential identity of heavy metal atoms. It also helps in detecting of disordered and/or wrongly modelled ligands. Both global and local analyses can be used to detect errors in atomic structures thus helping in (re)building, refinement and validation of macromolecular structures. It can also serve as an additional validation tool during data deposition to the PDB. Synopsis Macromolecular atomic B value distributions have been modelled using a mixture of Shifted Inverse Gamma Distribution. Also, B value and resolution dependent local ADP differences have been applied for validation of heavy atoms and ligands.


Introduction
The ever-increasing number of macromolecular structures solved by crystallographic and cryoEM methods and deposited to the PDB (Berman et al, 2002;Lawson, 2011) requires statistically robust and automatic tools for refinement (Sheldrick, 2008 Vriend, 1990). Designing such tools for ADP validation is less intuitive and, although importance of this problem has been stressed by many authors (Rupp, 2009;Merritt, 2011;2012), currently there are no widely used tools to check and validate ADPs. One of the potential reasons is that ADPs reflect many such shortcomings of the modelling as crystal deficiencies (e.g. anisotropy, modulation, imperfection of crystals), inaccurate assumptions in data acquisition and processing, modelling problems (modelling the mobility of molecules using individual ADPs is essentially equivalent to the assumption that atoms are oscillating independently around their central position and such oscillation is harmonic and moreover, all unit cells behave exactly same way) and intrinsic mobility of atoms within molecules and molecules within crystals (Kuhs, 2006). There are several papers describing the use the ADP distribution as a validation criterion (Carugo, 1998 2. Some parts of the model (loops, ligands or even domains) may have been placed incorrectly. Essentially, such behaviour indicates that there is very weak or no evidence to support the presence of these parts of the structures, and as such they should be considered with extreme care.
If it is assumed that the noise level on the map is approximately constant over the unit cell, then it can be claimed that local signal to noise ratio depends on the height of the local average electron density and that in turns depends on the local mobility of molecules, therefore it can be expected that: 1) if atoms are placed in wrong positions, then during refinement their B values will increase dramatically to reflect absence of the density, as signal to noise ratio in these regions are close or equal to zero; 2) if two or more domains/subunits have different intermolecular and/or crystal contacts, then they will have different ADPs reflecting their mobility, thus, reducing signal to noise ratio and making interpretation of such regions very difficult. In both cases there will be multiple modes of ADP distribution and correspondingly, signal to noise ratio will be different. It means that, at least for some of the crystal structures, local signal to noise ratio and therefore local resolution will vary over the unit cell, being more or less constant (more precisely having the distribution corresponding to the ADP distribution) over the region covered by the atoms corresponding to the same mode.
In this work we model multimodal ADP distributions as a mixture of SIGDs which can potentially be used further to identify mismodelled and/or structurally compact regions. This fact, among a number of other odd behaviours of ADPs, has been shown by Rupp (2009) in his fine textbook on Biomacromolecular Crystallography.
Although modelling of overall ADP distribution is a good technique for identification of suspicious/interesting regions of crystal structures, it does not allow identification of individual mismodelled atoms, residues or ligands. To address this problem, we consider local ADP differences in a given crystal structure. In general, it is reasonable to assume that if two atoms are close to each other in space then their mobility and therefore their ADPs should be similar.
It makes sense if we consider molecules including waters as an elastic network; an oscillating atom has almost immediate effect on its surrounding. Moreover, if atoms have been modelled correctly then all factors influencing ADPs of an atom should also influence neighbouring atoms. Therefore, dramatic differences between ADPs of atoms close to each other in 3D space can mainly be due to different occupancies of the atoms and/or different atom identity, i.e. heavy atoms may have been modelled as light atoms or vice versa.
One of the problems is that the meaning of closeness of two ADP values is not entirely clear.
For example, difference between 100Å 2 and 150Å 2 can be less significant than difference between 10Å 2 and 15Å 2 . Moreover, the resolution will also affect the significance of these differences. Therefore, to analyse differences between B values of atoms the resolution as well as the ADPs need to be accounted for. Wang (2018) uses a similar idea to analyse the occupancies of atoms for different elements in crystals. Here this idea is used to calculate differences between ADPs as well as potential adjustment of occupancies to make ADPs of neighbouring atoms similar.
Organisation of the paper. First, mathematical formulation of the problems of modelling of ADP distribution using mixed SIGDs is described and then formulation for analyses of local differences is given. Then the described methods are applied for re-refined structures from the PDB and the results are analysed. The expectation and maximisation algorithm (EM) described by Bishop (2006) is used for the estimation of the parameters of the distribution defined in (1) and (2). Direct application of EM algorithm for the mixture of SIGDs turned out to be unstable. Therefore, the parameters were estimated in four steps: 1) Convert ADP distribution to Peak Height distribution (PHD); 2) use

Global ADP analysis
Silverman algorithm (Silverman, 1985) as implemented in scipy package to find the number and centroids of the modes; 3) using found number and initial centroids of the modes, fit the mixture of Gaussians into PHD; 4) Starting with the parameters found in the previous steps estimate the parameters of the mixture of SIGDs using EM algorithm (see Appendix A).

Peak Heights and local ADP analysis
For analysis of relative occupancies of neighbouring atoms peak heights of point atoms with a given resolution and ADPs are considered. In reality the noise level on the amplitudes and phases as well as the weights used in map calculations should also be accounted for. For simplification these factors are ignored. Peak height of a point atom is (Chapman, 1995): Where s max =1/d max is the maximum resolution, B mod is the ADP, erf is the error function (for survey of special functions see Abramovitz and Steugun, 1964). Masmaliyeva and Murshudov (2019) use this formula to demonstrate that there is a resolution dependent effect on PHD. If two atoms with ADPs equal to B 1 and B 2 are considered, then a question can be posed like this: how much the occupancy of the second atom should be changed so that the peak height becomes the same as a fully occupied first atom: It is solved for c to give: For point atoms putting the expressions from (3) results in: When ‫ݏ‬ ௫ ՜ ∞ this formula converges to: Note that the optimal occupancy value is achieved when the expression in (4) becomes zero, meaning that by changing the occupancies, peak heights at the centre of atoms could be changed arbitrarily. Possible minimum and maximum values of the estimated relative occupancies are c=0 and c=∞ which are achieved when B 2 =0 and B 1 =0 respectively.
Obviously, there is no physical meaning for the infinite relative occupancy; it is an artefact of using peak heights at the centre as a guide for atomic identity.
Since the most refinement programs try to fit the total density of atoms into the data, it might be better to use the differences between them to evaluate optimal occupancies. We would like to find the best occupancy for the second atom so that its total density is similar to the first atom: Using Parsewall's theorem (ignoring constants): For point atoms this can be written as: Note that when ‫ݏ‬ ௫ ՜ ∞ then this formula becomes: which could be used if the resolution is not known. However, it is recommended to use the resolution, if available, from the PDB entry and set that to the median resolution in all PDB entries (see Table 1) which seems to be around d max =2.1Å. Note that in this case the maximum estimated relative occupancy would be achieved when B 1 =0 which gives 83, meaning that, in general, this method will underestimate the occupancy of atoms/ligands/residues. The minimum of (6) is achieved when B 2 =0 which gives c=0.
No value of c can make the expression in (6) equal to zero unless B 1 =B 2 . It means that the only valid explanation of the density is using the correct atoms, which may never be possible.
Formulas (5) and/or (8)  space-group inconsistencies between the PDB and data files. We also excluded the cases with R factors > 0.3. Table 1 gives a short summary of the selection of PDB entries. Table 2 lists the examples of the PDB entries used in this work. It should be stressed that the aim of this contribution is not to criticize a particular PDB entry; rather, we would like to highlight the shortcomings of the techniques used at the time of elucidation of these structures and necessity of re-modelling and re-refinement as new technologies become available.

Results and Discussions
The examples below are aimed to demonstrate three aspects of ADPs: 1) modelling of multimodal distributions; 2) identification of mismodelled heavy/light atoms; 3) ligand validation.

Multimodal ADP distribution
The The ADPs correspond to the sum of two random variables -"true" ADP and errors in the estimation. As a result, under a naïve assumption that these two random variables are independent the observed distribution becomes the convolution of SIGD and Gaussian distribution leading to more symmetric distribution.
Estimation of multimodal ADP distributions shows that 13902 out of 90840 cases exhibit multimodality, most of them are bimodal. Because of the reasons given above the second and higher modes are more symmetrical. There are only 266 PDB entries for which ADP distributions show three modes. One such example is 5TU8 (Figure 2). The Figure 2 shows  Table 3.
In case 4RQZ entry there are two well defined modes (Figure 3a,b). The molecule has three domains, two of which make contacts with each other and their symmetry mates, these domains are responsible in the crystal formation. The third domain makes contacts only with its symmetry copy (Figure 3c,d). This domain together with its symmetry mate is free to move without further contacts hindering such movements.
Parameters of the mixture of B value distributions for 4RQZ are given in the Table 4. As it can be expected the density for the domains corresponding to the second mode is weaker than that for the first two modes (Figure 3).

Local ADP analysis
The algorithm described in the Appendix B was applied to all PDB entries considered and more than 1900 entries with data resolution 2Å and better were analysed manually. More than 600 entries were identified as potentially containing heavy atoms and their densities were studied carefully. The electron density corresponding to the atoms marked as light atoms are weaker and in many cases these atoms are exposed to solvent. As a result, in many cases the exposed atoms have high ADPs than surrounding atoms. Residues containing these atoms could have multiple conformation and they might have been subjected to radiation damage. Analysis of radiation damage is outside of the scope of this work.
One example of potentially lighter than surrounding atoms (CD1 of 131 residue ILE of chain A) in the pdb 2WXU is given in Figure 4. The calculated optimal occupancy is 0.63. The ADP of this atom is 36.94 whereas median of the ADPs of surrounding atoms is 19.83. Figure 4b shows that this residue has been modelled in a wrong rotamer. After correction using the It is likely that some metals are modelled as water by automatic water picking software as they usually do not analyse the interaction with the surrounding atoms and the height of the electron density. They usually look for the existence of the difference density. The several such cases have been identified in the examined pdb entries. Figure 5  calculations the resolution of the data must be sufficiently high and atoms must be well defined. Figure 6 shows this atom together with its symmetry mate and its coordination together with the map.

Application of local ADP analysis to ligand validation
Local analysis was also applied for ligands. In this case all ligand atoms where considered and median ADP of ligand was compared to that of neighbouring atoms. There were number of cases where ligands were marked as having substantially less than full occupancy.
There were number of SO 4 and PO 4 groups that do not seem to have supporting experimental evidence. We did not consider these cases further. There are number of Zn and other metals with suspected density, since these cases are considered by Touw et. al. (2016) we did not analyse them further. More than ten PDB entries were inspected in detail. Among them three of them were selected for this work. These are 5X1O with ligand I3P -Inositol 1,4,5trisphosphate, 5ORJ with ligand I6P -Inositol 1,2,3,4,5,6-hexakiphosphate and 6B9B with the ligand MAL -Maltose. Table 6 gives the relative estimated occupancies for these ligands together with the median ADPs of the ligands and surrounding atoms. is absent or present with low occupancy (Figure 8a). Its median ADP is 300.01 and that of surrounding atoms is 52.15. Inspection of the density and symmetry related molecules showed that this ligand is on a two-fold-axis resulting of non-bonding repulsions of symmetry related molecules moving them out of the density. After 40 cycles of refinement with half occupancy the median ADP of the ligand became 84.84 with that of the neighbours equal to 51.77 and estimated occupancy became 0.66. The difference density became clean (Figure 8 b), although the density is still weak. It suggests that although half occupied ligand fits better there is still some disorder or mobility of this ligand. It is also a clear demonstration that during model building and refinement crystal symmetries must be accounted for.

Case 3: 6B9B
Occupancy of the ligand MAL of B chain of this PDB is estimated to be 0.414. Electron density shows there is no convincing evidence that fully occupied ligand exists in the crystal (Figure 9a). Refining the model without this ligand and adding waters according to the difference maps again cleaned up the density (Figure 9b).
All these examples show that comparative analysis of ADPs of ligands with their neighbours can play a role in validation and has a potential in identifying wrong or disordered ligands. In future the use of the SIGD and local B value discrepancies in designing ADP restraints will be explored. It seems that using resolution dependent ADP restraints may result in better refinement. It could also be applied for atoms sufficiently far from each other.