Likelihood-based estimation of substructure content from single-wavelength anomalous diffraction (SAD) intensity data

An intensity-based likelihood method is provided to estimate scattering from an anomalous substructure considering the effect of measurement errors in Bijvoet pairs and the correlations between these errors.


Introduction
The anomalous differences between Bijvoet pairs of reflections can be exploited for phasing in crystallography. However, the anomalous differences in intensities are generally limited to a few percent in size, and special care needs to be taken in planning the experiment and in collecting and processing the data in order to measure such differences with sufficient precision for successful phasing (Terwilliger et al., 2016b). Planning the experiment benefits from estimating the achievable anomalous difference, considering the number of anomalous scatterer sites that might be present and the precision with which the intensities are measured (Terwilliger et al., 2016a).
Both SHELXD (Schneider & Sheldrick, 2002) and AutoSol (Terwilliger et al., 2009), the experimental phasing suite in Phenix (Liebschner et al., 2019), require a prior estimate of how many anomalous scatterers are expected in the substructure. The most accurate estimates are obtained when there is a known stoichiometry for an intrinsically bound metal, so that the size of the substructure depends only on the number of copies in the asymmetric unit. For soaking experiments with heavy metals or halides, initial estimates of the number of sites depend on rules of thumb that are typically ISSN 2059-7983 based on the number of residues. Even when phasing with intrinsic scatterers such as S atoms in native proteins or with Se atoms in proteins incorporating selenomethionine (SeMet), parts of the chain may be disordered, selenium substitution may be incomplete or radiation damage could reduce their occupancy by the end of the X-ray diffraction data collection.
This work looks at characterizing the data after the experiment has been performed and the data have been processed. Specifically, we are addressing the problem of determining, from the data, the amount of scattering contributed by the anomalous substructure. This provides both an estimate of the size of the actual anomalous differences between Bijvoet pairs and information about the number of sites that is expected in the substructure. The underlying approach is to devise a likelihood target that can be used to determine parameters that quantify the strength of anomalous scattering, considering the effect of errors in measuring intensities of Bijvoet pairs and also the effect of correlations in these errors.
The derivation of the likelihood target starts with understanding how the strength of anomalous scattering affects the sizes of the differences between the true Bijvoet mates, represented through their joint probability distribution. This is developed in Section 2, along with the impact of intensitymeasurement errors on the distributions of measured intensities. The manipulation of probability distributions of acentric structure factors is much more straightforward with complex sources of error, so the LLGI approximation is introduced, allowing the effects of scalar errors in intensities to be modelled very well with complex errors.
The likelihood target itself is defined in Section 3, expressed in terms of parameters that depend on the total strength of scattering from the substructure of major anomalous scatterers and other parameters describing the degree to which measurement errors in Bijvoet pairs are correlated. The approximations underlying the likelihood target are validated by showing that they agree very well with exact relationships evaluated by (expensive) numerical integration calculations.
Section 4 develops methods to interpret the adjustable parameters from the likelihood target. It is assumed here that one has knowledge of the strongest anomalous scatterer contained within the crystal, the likely content of the crystal (i.e. the number of copies of protein or nucleic acid sequences expected to be found in the asymmetric unit), the wavelength of data collection and the associated scattering factors for anomalous scatterers. Building on this, it is possible to estimate the equivalent number of fully occupied anomalous scatterers, along with their overall B factor, that would be required to explain the differences between Bijvoet mates.
Section 5 describes the collection and curation of a large set of test data and the design of the calculations using these data. Finally, the results of these calculations are outlined in Section 6, evaluating the extent to which the actual substructure content can be predicted from the measured data before a substructure has been determined and refined.
The parameters that are estimated to characterize the SAD intensity data are also required to refine substructure models and obtain phase-probability distributions. This will be investigated in future work, along with ways to assess anomalous signal through measures of information gain and estimates of the log-likelihood-gain score that would be achieved with an ideal substructure model.

Intensity-based joint probability distributions for SAD data
To derive probability distributions for measured diffraction data for use in crystallographic likelihood functions, it is necessary to combine the effects of complex differences in the structure factors with those of scalar measurement errors in the intensities. This is further complicated by the fact that the amplitude of the structure factor is related to the square root of the intensity; the true intensity is never negative, but the measured intensity may well be. We have not found a way to derive exact analytical expressions combining these differences. Nonetheless, in our previous work on the LLGI intensity-based likelihood target (Read & McCoy, 2016), we showed that a log-likelihood-gain score that accounts exactly for the effect of Gaussian measurement errors on intensities can be approximated extremely well with a target computed via the Rice function (for the acentric case), in which the intensity and its standard deviation are transformed into an effective amplitude and a Luzzati-style weighting term approximating the effect of the scalar measurement error as an error in the complex plane. Importantly, the effective amplitude and the weighting term are independent of calculated structure factors from a model, so they only need to be determined once. Here, we investigate whether the same approach can be extended to intensity-based iSAD likelihood targets for SAD data, in which there is a pair of correlated intensity measurements for each set of Miller indices. We concentrate on what can be deduced about the joint distribution of the true Bijvoet mates from the corresponding intensity measurements and what this can tell us about the scattering power of the anomalous substructure.
In the following, we make a number of simplifying assumptions.
Firstly, we assume that the intensities (or Bijvoet pairs of intensities) measured for different Miller indices are independent of each other. This is not strictly true, but the correlations arising from the presence of bulk solvent or the existence of noncrystallographic symmetry are much weaker than the strong correlations between Bijvoet pairs. Secondly, we assume that the phase angles of the individual atomic contributions to structure factors are independent, so that the total structure factors can be considered to arise from a random walk in the complex plane, leading to a complex normal distribution. For this to be true, it is sufficient for the atoms to be randomly distributed in their distances to the Bragg planes associated with any particular reflection; apart from the lowest resolution reflections, it is not necessary to make the much more restrictive assumption that the atoms are randomly distributed throughout the unit cell.
Thirdly, we further assume that this independence extends to the substructure of anomalous scatterers, so that the joint distribution of Bijvoet pairs follows a multivariate complex normal distribution.
Fourthly, we assume that intensity-measurement errors are drawn from a Gaussian distribution, which is independent for reflections with different Miller indices, although there may be a correlation in measurement errors for Bijvoet pairs.

Joint prior distribution of true Bijvoet mates
To set the stage for characterizing the substructure content, we start by defining the joint distribution of true Bijvoet mates (with no measurement error) in terms of the atomic content of the crystal, divided into the most significant anomalous scatterer (for which a substructure might be determined during the process of phasing) and the rest of the atoms. Note that we are not assuming here that the rest of the atoms lack any anomalous scattering contribution. For instance, in SeMet phasing the S atoms in cysteine residues will make a small but non-negligible contribution to the anomalous differences, even though it is only rarely possible to identify the positions of these atoms during substructure determination.
The Bijvoet mates are described in terms of F + and the complex conjugate of F À , F ÀÃ , because these are highly correlated and have similar phase angles. Individual elements differ in their (in general) complex scattering factor f j , and each atom will differ in its position x j , occupancy o j and displacement parameter B j , as shown in (1a) and (1b).
In these equations, h is the vector of Miller indices and s is the corresponding diffraction vector, the magnitude of which is the inverse of the interplanar spacing. As discussed in our earlier work on SAD phasing (McCoy et al., 2004), the joint distribution of Bijvoet mates takes the form of a multivariate complex normal distribution, which is readily derived by assuming that each atom contributes independently to the total structure factors and considering the atomic parameters to be random variables. [The effects of correlations between atomic contributions arising from translational noncrystallographic symmetry (tNCS) can be addressed by modifying the expected intensity factors in the final equations, as described previously for the case of normal scattering (Read et al., 2013).]. Equation (2a) defines the prior joint distribution (before a substructure model is available), where the expected values of the complex Bijvoet mates are zero in the absence of any prior structural knowledge and the Hermitian covariance matrix is defined in (2b). where The diagonal variance term, AE N , is simply the scattering power of the crystal defined in terms of the scattering factors in (3a), while the off-diagonal covariance element, FF , is defined in (3b) and " is the expected intensity factor arising from the statistical effects of crystal symmetry. The superscript H denotes the Hermitian transpose, i.e. the transpose of the complex conjugates.
These structure factors are the sums of atomic contributions for N atoms, which will be divided below into the H atoms that could be identified as an anomalous substructure (generally a single primary anomalous scatterer type) and the remaining background (B) atoms that have relatively little anomalous scattering. Depending on the context, intrinsic anomalous scatterers, such as S atoms in cysteine and methionine residues, could either comprise the H atoms or be considered to be part of the B atoms if there is a stronger anomalous scatterer in the crystal. In both cases, the sums can be taken separately over the B and H subsets.
The scattering factor can be expressed as f j = (f 0 + f 0 ) + if 00 , which is a function of both wavelength and resolution. Note that the wavelength-dependent correction terms f 0 and f 00 are essentially independent of resolution as they arise from innershell electrons that can be considered to be point scatterers at the relevant resolutions. The wavelength-independent form factor f 0 provides the resolution dependence. For (3a) and (3b), we can expand the scattering factor terms to obtain The structure factors can be normalized to give E-values with a mean-square value of 1 by dividing them by the square root of their expected intensities, "AE N . In the joint distribution of E-values, there is just a single complex correlation parameter, FF : where AE ¼ 1 FF Ã FF 1 ð4bÞ research papers Because the Bijvoet pairs are highly correlated, values of FF in practice have magnitudes of only slightly less than one. The deviation from one tends to increase with resolution, because f 00 is effectively independent of resolution, whereas the real parts of the scattering factors decrease with resolution.

Correlated measurement errors in measured Bijvoet mates
In the LLGI approach to accounting for the effect of measurement error, the intensity and its standard deviation are transformed into an effective amplitude F e (or E e for normalized data) and a Luzzati-style weighting factor D O that, together in a Rice probability function, give an excellent approximation to the posterior probability of the true amplitude given the intensity. In the related iSAD approach to an intensity-based likelihood function proposed here, both members of the Bijvoet pair are transformed in the same way.
As demonstrated below, this approach is well justified when the measurement errors in the Bijvoet mates are uncorrelated, but requires some elaboration when they are correlated. As discussed by Garcia-Bonete & Katona (2019), time-dependent effects on the measured intensities, such as radiation damage, can lead to correlations between the errors of mean intensity measurements, and there is evidence of such correlations in some of the data sets that we have examined (discussed below). Correlations in measurement errors can be accounted for by assuming that the errors are drawn from a bivariate normal distribution in which the individual variances are obtained from the data-processing analysis but in which a nonzero correlation is present. For simplicity of notation, we use Z to represent the square of an E-value (or, equivalently, a normalized intensity). A joint probability distribution for the effect of correlated measurement errors on the observed normalized intensities is given in (5).
where Z + and Z À are the true values of the normalized intensities, Z þ O and Z À O are their respective observed values, Z þ O and Z À O are the respective standard deviations of the measurements and AE is the correlation coefficient between the measurement errors.
It seems reasonable to conjecture that the effect of this correlation on the iSAD approximation can be modelled by assuming that the implied complex errors in the structure factors are correlated to the same degree as the real errors in the corresponding measured intensities. In the iSAD approximation (as in the LLGI approximation), the effective normalized amplitude arises from a complex structure factor that is obtained by adding a complex normal error to the down-weighted true structure factor, as given in (6).
where hjD þ j In this approximation, D þ O plays the role of a complex correlation between the true E + and the phased effective amplitude E þ e . Note that, because of the D þ O weight on E + , the expected value of ðE þ e Þ 2 is one. Equivalent expressions apply to the Bijvoet mate. The assumption that the complex errors are correlated to each other, with a complex correlation coefficient that has a magnitude equal to AE , allows us to determine the complex correlation between E þ e and E ÀÃ e , defined as FF,obs , by analogy to the complex correlation FF between the corresponding true values E + and E À . This is shown in (7), where we assume that the complex errors are uncorrelated with the true weighted structure factors, so that cross-terms such as For simplicity (also justified by the consideration that the implied complex error is effectively modelling the error in the amplitude, i.e. the error parallel to the structure factor), we will assume that AE (and thus FF,obs ) has the same phase as FF . In any event, in the situations considered here only the absolute value of FF,obs influences the outcome, although the phase of the complex correlation will influence phasing calculations when a substructure model is considered in future work.
By analogy to (4), the joint distribution of the phased effective amplitudes is defined in (8).
where AE ¼ 1 FF;obs FF;obs 1 : 3. The data likelihood target: joint distribution of effective amplitudes The probability distribution in (8) relates structure factors, but the measured data are intensities with unknown phases, which have been transformed into the effective amplitudes in this equation. The phases in (8) can be integrated out to obtain a likelihood function that depends only on the effective amplitudes, given in (9).

research papers
Note that there is only a single parameter to describe the variance of this distribution, FF,obs . However, FF,obs is itself a function of D þ O and D À O , which are fixed values obtained in the calculation of the effective amplitudes, and of the adjustable parameters FF and AE . As discussed above, FF,obs can be treated as a scalar (as well as the underlying FF ) in this context, because any phase component has no effect on the likelihood in the absence of a substructure model. This likelihood function, which is the main focus of this work, can be used for two purposes. Firstly, the adjustable variance parameters can be refined to characterize the data in terms of the strength of the anomalous scattering (| FF |) and potentially the degree to which the measurement errors are correlated ( AE ), if this parameter is not available from an analysis during the merging step of data processing. Secondly, it provides the likelihood score for a null hypothesis in phasing, i.e. the baseline for a log-likelihood gain (LLG) when a substructure model is available. In other words, it can play an equivalent role to the Wilson distribution (Wilson, 1949) in the LLG used for purely real scattering in molecular replacement or refinement. Here, we will explore the uses of this likelihood function to characterize the data, particularly to estimate the substructure content.

Validation of the iSAD approximation
To verify that it is appropriate, firstly to construct the iSAD approximation by transforming the observed intensities independently into effective amplitudes and D O factors and secondly to assume that the same correlation parameter AE can be used to model the effect of correlated measurement error, we have followed the approach used in validating the LLGI target (Read & McCoy, 2016) by comparing the conditional probabilities of the true amplitudes given the observations, obtained either with the exact treatment or with the iSAD approximation.
The gold standard for the comparison is the joint conditional probability distribution for the true amplitudes given the observed normalized inten- , derived by following the propagation of errors and using numerical integration, giving (19) in Appendix A. The corresponding joint conditional distribution, given the effective amplitudes from the iSAD , is provided as (23) in Appendix B. Comparison of exact and approximate probability distributions for the true normalized amplitudes conditional on the observed intensities. (a) Contour plot illustrating pðE þ ; calculation modelled on SeMet phasing where both the intrinsic anomalous signal and the measurement error are significant. In one case the measurement errors are assumed to be independent, whereas in the second case the errors are assumed to be highly correlated, with AE = 0.75. The exact distribution and the iSAD approximation are indeed very similar in both cases, while the introduction of correlated errors has a profound effect on the distributions. Similar results were obtained in other calculations where the level of anomalous signal, the measurement error and the correlation of measurement error have been varied (not shown), justifying the use of this approach.

Maximum-likelihood estimation of substructure content
When phasing with intrinsic anomalous scatterers, such as Se atoms in SeMet constructs or S atoms in native proteins, one has reasonable prior knowledge of the atomic composition of the crystal. Even in this favourable case, there is uncertainty about the degree to which the expected sites are ordered and potential uncertainty about the occupancy of Se sites because of variable SeMet incorporation. When soaking with heavyatom compounds, halides or other derivatives, only a rough guess can be made in advance about the degree of substitution. Refinement of the variance parameters in a log-likelihood function based on (9) should enable a reduction of the uncertainty in the substructure content relative to other atoms in the crystal. This will be useful in characterizing the phasing signal as well as in judging the difficulty of substructure determination.
There is a direct relationship between | FF | and relative substructure content if we treat the scattering power of only one primary type of anomalous scatterer as unknown. The anomalous scatterer content can be placed on an absolute scale if the number of copies of the protein in the crystal can be deduced from the Matthews volume (Matthews, 1968). Equation (10) is a simple consequence of (3) and (4), given that the primary anomalous scatterer (H) atoms share the same scattering factor, denoted f H here. where In (10a) the overall Wilson B factor (B W ) has been factored out of the primary anomalous scatterer contributions, leaving the individual atomic differences (ÁB j ) in (10b). For the substructure content analysis, these equations are simplified by factoring out the overall Wilson B factor from all sums, approximating the B (other background) atoms as sharing the same overall B factor, and approximating the H atoms as sharing the same ÁB H relative to the B factor of the B atoms. These approximations give (11).
In (11d), n H is the equivalent number of fully occupied atoms with the same total scattering power as the substructure, (which is weighted by the sum of occupancies squared); this is not necessarily and indeed is not usually an integer.
To convert | FF | for a resolution shell to a value of x H , (11a) is solved for x H by transforming it into a quadratic expression in x H , shown in (12). where There are in general two solutions to the quadratic, as illustrated in Fig. 2. In the current implementation, the solution corresponding to a smaller substructure is chosen, although if a prior probability distribution for the substructure size were provided the two solutions could be assigned relative posterior probabilities. Complex correlation as a function of substructure composition. The calculated magnitude of the complex correlation, | FF |, is shown in blue as a function of the assumed number of Se atoms in the asymmetric unit (computed against a background of 1000 C, N or O atoms and ten S atoms). Intersections with the horizontal orange line illustrate that a refined value of 0.995 for | FF | is consistent with either about 11 Se atoms or 370 Se atoms. The minimum value of | FF | consistent with the assumed background composition and nature of the primary anomalous scatterer is about 0.990.
One approach that has been tested is to use the resulting x H values for resolution bins to estimate values of n H and ÁB H by transforming (11d) into (13) and fitting a least-squares line.
However, we have found that a slightly better stability is obtained with an alternative approach. The target function given in (14) is minimized, starting from a grid search varying n H and ÁB H over a range of values consistent with the x H estimates obtained from the refined | FF | values.
The first term in T restrains ÁB H to zero, with a standard deviation typically set to 5 Å 2 . The factor , which is typically set to 0.1, controls the damping of the robust Geman-McClure loss function comprising the second term. The calculated values of | FF | are computed using (11). The standard deviations for | FF | values are obtained from the inverse of the second-derivative (Hessian) matrix of the likelihood target computed for the optimized parameters.

Strategy for the refinement of variance parameters
Refinement of the | FF | and AE parameters to maximize the likelihood function based on (9) is implemented in the SCA (substructure content analysis) mode of Phasertng, which is under development (McCoy et al., 2021). In the current implementation these parameters are refined in resolution bins, with a minimum of 500 reflections per bin. Two refinement macrocycles are carried out. In both macrocycles the bin values for AE are constrained to lie in the range 0-0.9, with a weak quadratic restraint towards the value of 0 (standard deviation of 0.5) so that error correlations are inferred only when required to explain the data. In addition, a quadratic smoothness restraint penalizes AE values that differ from the value computed from the line connecting the two nearest neighbours (standard deviation of 0.025). This is similar to an approach used to stabilize the refinement of A values for maximum-likelihood refinement when they are evaluated using just the cross-validation data (Pannu & Read, 1996). In the first macrocycle, the bin values for | FF | are constrained to lie in the range 0-1 while being otherwise unrestrained. At the end of this macrocycle, values of n H and ÁB H are estimated from the bin values for | FF | as discussed above. Some values of | FF | are too low to be achieved with any value of x H for a given anomalous scatterer, as shown in Fig. 2. Resolution bins violating this constraint are ignored in the determination of n H and ÁB H , and their values for | FF | are reset to those computed from the values of n H and ÁB H estimated from all of the data. This situation generally arises near the resolution limit, when the anomalous signal is very small relative to the noise.
For the second macrocycle, the estimated values of n H and ÁB H are used to determine target values for x H , and thus | FF |, for each resolution shell. Loose restraints for | FF | are applied to smooth the curve as a function of resolution, with the standard deviation being determined by the change in | FF | that would change x H by a factor of 1.5. This can stabilize refinement in cases with weak signal to noise, but has relatively little effect in most cases. In addition, | FF | in each bin is constrained in this macrocycle to lie between the minimum that can be achieved with any value for x H and the maximum possible value, corresponding to x H = 0.

Collecting and curating test data
The method to determine substructure content was tested on a database of SAD data sets provided by collaborators or downloaded from the Worldwide Protein Data Bank (wwPDB; Berman et al., 2000). 124 data sets were kindly provided by Zbigniew Dauter, most of which have been discussed earlier (Banumathi et al., 2004;Dauter et al., 2002;Wang et al., 2006). 162 data sets (which include MAD data sets split into individual wavelengths and considered as SAD data sets) were collated by Tom Terwilliger from JCSG experiments and have been discussed earlier (Bunkó czi et al., 2015).
The majority of the data sets in the database were downloaded directly from the wwPDB. The advanced search option of the RCSB PDB was used to perform queries. A list of PDB entries was collected which had a 'Structure Determination Method' record containing the word 'SAD' and a 'Citation' record, and for which experimental data including Bijvoet pairs had been deposited. Data extending to poorer than 4 Å resolution and structures possessing tNCS were excluded. This list was split into three categories.
(i) Soaking experiments, comprising structures determined with any halides, heavy metals, noble gases or other elements from derivatives commonly used in phasing experiments.
(ii) SeMet experiments, comprising structures containing Se atoms (in order for these not to dominate the database SeMet structures were restricted to entries deposited after 1 January 2018).
(iii) Sulfur SAD phasing experiments, which were identified by examining PDB entries that provide Bijvoet pairs but do not contain any atoms heavier than S.
For each entry, the Phenix package phenix.fetch_pdb command with the argument --mtz was used to download the model, sequence and structure factors, and to convert structure factors from cif to MTZ file format. The values for wavelength, cell dimensions, resolution and space group were verified with the associated publications, and any inconsistent data were removed from the list. Each data set was associated with the element type expected to contribute most strongly to the anomalous signal, denoted the primary anomalous scatterer. A total of 536 data sets were selected initially. We were surprised to note that none of these are affected by twinning, an observation that highlights the difficulty that current phasing methods have with such data.

research papers
The data sets were screened for the presence of at least minimal anomalous signal during the initial step to generate reference substructures using the MR-SAD protocol (discussed below). Several data sets had so little anomalous signal that no anomalous scatterers could be detected. A significant number of other data sets had such poor anomalous signal that only a small fraction of atoms in the substructure were placed correctly. These data sets were omitted from the subsequent analysis, leaving 382 of the original 536. It seems likely that many of these data sets are in fact native data for structures that were solved by SAD phasing using separate data that were not deposited. For a small additional number of data sets that were omitted, the reported wavelength was incompatible with the strength of the anomalous signal. This left 357 data sets in the curated database.

Generating reference substructures
Reference substructures were generated using the MR-SAD protocol available in Phaser (Read & McCoy, 2011). To be consistent in the use of structure-factor amplitudes (needed for the current version of Phaser, which does not work with intensity data), deposited intensity data (whenever available) were converted to structure-factor amplitudes (|F|) and their estimated standard deviations for the MR-SAD step using the phenix.french_wilson tool (Liebschner et al., 2019). For data sets with only deposited structure-factor amplitudes, these were converted to approximate intensity measurements as described for the LLGI target (Read & McCoy, 2016) for the substructure content analysis step. In the MR-SAD protocol, the deposited atomic model of the protein is used as a starting model for phasing, but is treated as being composed of purely real scatterers. In the approach used here, anomalously scattering centres were found using SAD log-likelihood-gradient maps (McCoy & Read, 2010) to search for purely imaginary scatterers, since the real scattering at each centre was already accounted for in the deposited model used for phasing.
Purely imaginary scatterers found in the MR-SAD step were replaced with the atom type of the corresponding atom in the deposited structure to give the anomalous substructure, annotated using phenix.emma (Grosse-Kunstleve & Adams, 2003) to identify atoms that superimpose within a distance threshold. The parameters of the anomalous substructure were then refined against the data, without altering the substructure with log-likelihood-gradient completion (Read & McCoy, 2011). The refined f 00 for the primary anomalous scatterer and, for each anomalous scatterer type identified, the number of sites and the sum of the squared occupancies of sites, were stored in the database. The total scattering power of the anomalous substructure was evaluated in terms of the equivalent number of fully occupied primary anomalous scatterers, which was calculated as the sum of squared occupancies for each atom type weighted by the square of the ratio of the f 00 for that anomalous scatterer type and the f 00 of the primary anomalous scatterer. This approximation assumes that the contribution of any secondary anomalous scatterers, if present, is dominated by their imaginary contribution, and that differences among atom types in the ratio of real to imaginary scattering are less important. The quality of SAD phasing was assessed by computing the correlation between the experimentally phased map and density generated from the deposited model using phenix.get_cc_mtz_pdb. Many of the test data sets have been measured at a wavelength near the absorption edge of the primary anomalous scatterer, where the f 00 changes rapidly. For these data sets, the f 00 for the primary scatterer was refined as part of substructure refinement and phasing. Values of f 00 obtained from table lookup can have significant errors: the tabulated values do not account for the effects of the chemical environment (Evans & Pettifer, 2001) and the wavelength may not be known precisely because of errors in monochromator calibration (Ruslan Sanishvili, personal communication). It is best to obtain prior estimates of f 00 from a fluorescence scan of the crystal at the beamline (Evans & Pettifer, 2001), but in this study we do not have access to fluorescence-scan data for the test data sets. For the data sets collected near the absorption edge, we have therefore used the refined f 00 values for the primary anomalous scatterer obtained during refinement and phasing with the reference substructure. We expect the refined f 00 to be a better estimate of the true f 00 than the value from a table lookup, but there will be random errors. In the refinement, the f 00 value for an anomalous scatterer type and the overall occupancies of the individual atoms will be correlated, with both changing the imaginary terms in calculated structure factors but differing in how they affect the relative contributions of the real and imaginary terms as a function of resolution; how well these effects are decoupled will depend on the precision of the data. There may also be systematic errors. For instance, if there is a mixed substructure and some atom types are incorrectly identified, the refined f 00 values will reflect a compromise between the relative real and imaginary scattering of the different atom types.

Preparation of data for substructure content analysis (SCA)
The diffraction data were processed using the Phasertng.xtricorder module of Phasertng (McCoy et al., 2021). Phasertng.xtricorder carries out a series of data analyses to detect and correct for the statistical effects of anisotropy, tNCS and twinning, although none of the data included in this study were affected by either tNCS or twinning. The data were scaled and used for maximum-likelihood estimation of substructure content, which is carried out within Phasertng. xtricorder when the data include Bijvoet pairs. The known protein composition of the crystal was used when scaling the data; if an incorrect composition were used, the intensity scaling and therefore the estimated anomalous scatterer content would change proportionally.

Analysis of the effect of radiation damage
To test the hypothesis that positive correlations between the measurement errors for members of a Bijvoet pair can arise from the effects of radiation damage, we searched the Integrated Resource for Reproducibility in Macromolecular Crystallography (IRRMC; Grabowski et al., 2016; http:// proteindiffraction.org/) to find a data set with the keyword 'SAD', strong anomalous signal and high redundancy so that subsets of the full data could be analysed. The search yielded the data for PDB entry 3ot2 (Joint Center for Structural Genomics, unpublished work) with accession identifier https:// doi.org/10.18430/M33OT2. The data set comprises 360 images, which were integrated using XDS (Kabsch, 2010) from XDSGUI (https://strucbio.biologie.uni-konstanz.de/xdswiki/ index.php/XDSGUI). Subsets of the integrated data were scaled and merged in the same package before comparing the values obtained for the error-correlation parameter as a function of resolution.
All calculations were performed on a Dell Precision 5820 machine with 128 GB RAM and an Intel Xeon W-2145 CPU @ 3.7 GHz Â 16, running the CentOS version 7 operating system.

Overview of the curated database
The curated database consisted of 357 data sets for crystals representing a total of 23 different anomalous scatterers (Table 1). In 22 cases, a mixture of anomalous scatterer types contribute strongly (with secondary anomalous scatterers contributing up to 50% of the total anomalous scattering). The space-group sampling of the database is similar to the spacegroup sampling of the wwPDB (Wukovitz & Yeates, 1995). Of the 357 data sets, 251 had intensity data deposited, while the rest had structure-factor data alone. Fig. 3 shows distributions for a number of characteristics of the data. The resolution of the data sets ranges from 0.93 to 3.6 Å , with the total anomalous scattering ranging from the equivalent of 0.05 to 134 fully occupied atoms. The database included data collected across a range of wavelengths from 0.81 to 2.29 Å ; the largest peak in the distribution of wavelengths includes 143 Se-SAD data sets collected near the Se absorption edge at about 0.98 Å . There are three other notable peaks in the wavelength distribution: one near 0.9 Å , largely corresponding to highenergy remote Se data, one near 1.3 Å , corresponding to the Zn absorption edge, and one at 1.5418 Å , corresponding to Cu K home X-ray sources. The map-to-model correlations range from values of around 0.2 to up to 0.8 for data sets with very high anomalous signal.

Estimation of the total anomalous scattering
The SCA mode estimates the scattering power of the anomalous substructure, measured in terms of the equivalent number of fully occupied primary anomalous scatterers, as discussed in Section 3. The estimated number correlates well with the total anomalous scattering determined from the reference substructure, with a log-log correlation coefficient of 0.72 for data deposited as intensities (Fig. 4). (Supplementary Fig. S1 shows an equivalent plot also including data deposited as amplitudes; the correlation coefficient is still 0.72 but there are more outliers, likely reflecting the difficulty in reversing the transformation from intensities to amplitudes.) The estimates are also consistent across different element types (see Supplementary Fig. S2). However, the total anomalous scattering tends to be slightly underestimated (or, alternatively, the refined occupancies could be slightly overestimated).

Effects of radiation damage
PDB entry 3ot2 belongs to the cubic space group P23, and the diffraction data deposited in the IRRMC comprise 360 images with 0.5 oscillation per image, giving a total of 180 of data. With the high symmetry, there is greater than tenfold average redundancy for each observation of the plus or minus hand of the Bijvoet pairs. To confirm the presence of radiation damage during data collection, a model-phased difference Fourier was computed, comparing the data processed from the first 90 images with those from the last 90 images. The strongest peaks in the resulting map reveal the decarboxylation of a number of acidic side chains, a cluster of which are shown in Fig. 5.
The diffraction data were reprocessed to include four progressively wider ranges of radiation dose, including the first 90, 180 or 270 or all 360 images. The substructure content analysis was carried out for each merged data set, comparing the values of AE obtained in each analysis. As expected from the hypothesis that a correlation of errors between Bijvoet mates can arise from merging data suffering from different degrees of radiation damage, the values of AE increase with both resolution and total radiation dose (Fig. 6). The overall mean values of AE are 0.086 for data from the first 90 images, 0.122 for the first 180 images, 0.149 for the first 270 images and 0.160 for all 360 images.

Discussion
In SAD phasing based on structure-factor amplitudes, the difficulty of reliably extracting the anomalous signal from the  Table 1 Number of entries for each of the anomalous scatterers present in the database.
The total number of entries is greater than 357 as some data sets are counted multiple times when they contain more than one type of anomalous scatterer. noise introduced by intensity-measurement errors is further complicated by difficulties in converting intensity errors into amplitude errors. Our experiences with accounting for the effect of intensity-measurement errors in molecular replacement (Read & McCoy, 2016) suggested that the effects of scalar errors in intensity measurements could be approximated well as complex errors in structure factors, transforming the intensity data into effective amplitudes (F þ e and F À e ) and Luzzati-type weighting parameters (D þ O and D À O ). Numerical tests showed that the joint distribution of the true amplitudes in the Bijvoet pair, given the observed intensities, was approximated extremely well by this treatment when the measurement errors in the Bijvoet pair are independent. However, the results of preliminary test calculations suggested that in fact measurement errors are positively correlated. A measurement error-correlation parameter, AE , was introduced and further numerical tests showed that the joint distribution of the true amplitudes could still be approximated extremely well, even with strongly correlated measurement errors. This error treatment, therefore, will underlie our continuing work on an intensity-based SAD likelihood target, termed iSAD, which should strengthen the use of SAD data sets with marginal signal to noise.
The joint distributions of Bijvoet mates require knowledge of the atomic composition of the crystal and the atomic scattering factors (including the anomalous, or imaginary, contributions), which is generally only known approximately when collecting diffraction data from a crystal. The role of atomic composition in anomalous scattering can be summarized by a complex correlation parameter, FF , which varies smoothly with resolution and can therefore be determined in resolution shells. The joint distribution of observed amplitudes that takes account of the effects of anomalous scattering ( FF ) and measurement error ( correlations in measurement errors between Bijvoet pairs ( AE ), is the basis for a likelihood target that can be optimized in terms of the two types of correlation parameter, FF and AE . Given the atomic composition of the protein component of the crystal, as well as the presumed identity of the primary anomalous scatterer, the variation of FF with resolution can be interpreted in terms of the content of the primary anomalous scatterer (the equivalent number of fully occupied atoms) and the average difference between the B factors of the anomalous scatterers and of other atoms in the crystal. In practice, if different hypotheses about the number of copies of the protein in the asymmetric unit were being tested, the estimated anomalous scatterer content would change proportionally.
The validity of the likelihood target and the deductions that it allows about the anomalous scatterer content were tested by carrying out calculations on our extensive curated database. This demonstrated an excellent correlation between the predicted anomalous scatterer content and the content obtained by refining the known substructures against the same data.
The results presented here demonstrate the accuracy of the new statistical model for the effects of measurement error and atomic composition on the measurement of Bijvoet pairs of reflections. The deduced anomalous scatterer content can inform strategic decisions about whether it is likely that the substructure can be determined, how difficult the problem will be (as it depends strongly on the number of atoms to be found) and how to approach the substructure determination. The success of the statistical approach depends on the quality of the measurement-error estimates; our results imply that these error estimates, at least for data used successfully for SAD phasing, are reasonably accurate.
Work in progress will build on what is presented here, showing that the results of the substructure content analysis can subsequently be used to calculate a number of measures of signal for SAD phasing: the extra information content gained by measuring Bijvoet pairs and expected values for the log-likelihood gain, figures of merit and map correlations that will be achieved in phasing once a substructure has been determined. In the longer term, we plan to implement a new iSAD phasing calculation, which should yield better quality phase information for data with low signal. Error-correlation parameter, AE , as a function of resolution for different total levels of radiation dose. The curves show values obtained from analysing data merging the first 90 images (blue), the first 180 images (orange), the first 270 images (grey) and all 360 images (yellow).

Figure 4
Estimation of the equivalent fully occupied number of primary anomalous scatterers for data deposited as intensities. The horizontal axis is the total anomalous scattering power of the gold-standard substructure (weighted sum of squared occupancies of refined sites) and the vertical axis is the estimated anomalous scattering power. The dashed black line represents a perfect prediction, while the black line shows the least-squares linear fit of the estimates. Each data point is coloured by the map correlation coefficient as shown in the legend. Both axes are plotted on a log 10 scale.

Figure 5
Difference map between data from the first quarter and the last quarter of data collection for PDB entry 3ot2, computed with map coefficients F o,first À F o,last , calc , with phases calculated from the deposited structure. The map is contoured at five times its r.m.s. value; the strongest peak, at residue Glu110 of chain A, has a height of 8.41 times the r.m.s. value. This figure was made with ChimeraX (Goddard et al., 2018).

APPENDIX A Conditional probability of true amplitudes given the observed intensities
The first step in deriving the desired conditional probability distribution is to obtain the joint prior distribution of true normalized amplitudes, starting from the joint distribution of normalized structure factors in (4). A change of variables to amplitudes and phases yields (15) where, for notational simplicity, Z denotes the square of the corresponding E value [e.g. Z À = (E À ) 2 ] and the phase of E ÀÃ is referred to as À .
In (15), the phase of FF has been ignored; if it were included, it would simply add a phase shift to the phase difference and would therefore have no effect on the integral over all phases in the next step to obtain the joint distribution of normalized amplitudes in (16).
A change of random variables from normalized amplitudes gives the prior joint distribution of the normalized intensities in (17) (noting that, for example, dZ À = 2E À dE À ).
Assuming that the measured normalized intensities are related to the true values by the addition of correlated measurement errors drawn from a bivariate normal distribution, the conditional distribution of the observed normalized intensities is given in (5), which is repeated here for convenience.
The joint probability of both pairs of true and observed intensities, pðZ þ O ; Z À O ; Z þ ; Z À Þ, is obtained by multiplying together the expressions in (17) and (5), and the probability distribution of the observed pair of intensities is then obtained by integrating over all possible values of the true intensities, shown in (18).
Finally, Bayes' theorem is used to obtain the conditional probability of the true pair of intensities given the observed pair from the expressions in (5), (17) and (18), and a change of variables gives the probability distribution for normalized amplitudes shown in (19). As above, for notational simplicity, Z is used for the square of the corresponding E value.
In the evaluation of (19) used for numerical tests, the double integral from (18) is carried out analytically in Mathematica (version 12.0; Wolfram Research).

APPENDIX B
Conditional probability of true amplitudes from the iSAD approximation The first step in developing the desired probability distribution is to construct the joint distribution of the true normalized structure factors along with the phased structure factors corresponding to the effective amplitudes. The mathematical form for the relationships among these structure factors is the same as that considered for phasing SAD data when there are calculated structure factors from a substructure model, so the derivations below follow a similar outline to previous work on the SAD likelihood target (McCoy et al., 2004). To define the distribution, we need four new complex covariances (equivalent to complex correlations, because the variables are normalized), given in (20).
The overall joint distribution of these four complex structure factors is a multivariate complex normal distribution (21a) in which the expected values (before any measurements or other information) are all zero and the covariance matrix is given in (21b).
where AE ¼ research papers The basic strategy to obtain the desired conditional distribution is to change variables to amplitudes and phases, integrate over all the unknown phases to obtain a joint distribution of the amplitudes, and then apply Bayes' theorem. One route to this result is given in (22). pðE þ ; E À ; E þ e ; E À e Þ ¼ RRR 2 0 pðE þ ;E À ; À ;E þ e ; þ e ;E À e ; À e ÞpðE À ; À ;E þ e ; þ e ;E À e ; À e Þ pðE þ e ;E À e Þ d À d þ e d À e : ð22Þ In the triple integral, all of the phase terms contain phase differences so, if one phase is fixed at an arbitrary value, the others will vary over all possible values relative to each other and to the fixed phase. For this reason, one of the phase integrals can be omitted and the remaining double integral can simply be multiplied by 2, as shown in (23) For a similar reason, the phases of FF and FF,obs are ignored because they would just add a constant offset to the phase differences. The double integral is carried out analytically in Mathematica for the numerical tests. The three probability distributions needed for (23) are provided below.
The joint probability distribution for three phased structure factors, pðE À ; À ; E þ e ; þ e ; E À e ; À e Þ, is obtained by analogy to (21), but omitting E + as well as the first row and column of the covariance matrix and then changing the complex variables to amplitude and phase variables.
The probability of the amplitude of E + given the other three phased structure factors is obtained by first partitioning the covariance matrix from (21) to obtain the conditional distribution of E + in (24). pðE þ ; E ÀÃ ; E þ e ; E ÀÃ e Þ ¼ Next, after a change of variables from the complex E + to its amplitude (E + ) and phase ( + ), the expression in (24) Finally, the prior joint probability distribution of the effective amplitudes, pðE þ e ; E À e Þ, as given in (9)