On the application of the expected log-likelihood gain to decision making in molecular replacement

The expected log-likelihood gain can be used to predict the outcome of molecular replacement and optimize molecular-replacement strategies.


Introduction
Solving the phase problem by molecular replacement is a problem of signal to noise; the signal for the correct placement of the model must be found amongst the noise of incorrect placements. The signal of a placement is indicated by its translation-function Z-score (TFZ), which is the number of standard deviations over the mean (Z-score) for the loglikelihood gain on intensity (LLGI) in the translation function (TF). The most sensitive function for scoring the placements is a maximum-likelihood function based on the Rice distribution (LLGI). For a single acentric reflection, pðE e ; E C Þ ¼ 2E e 1 À D 2 obs 2 A exp À E 2 e þ D 2 obs 2 A E 2 C 1 À D 2 obs 2 A Â I 0 2 A D obs E e E C 1 À D 2 obs 2 where E C is the normalized structure-factor amplitude calculated from the placed model, A is the fraction of the calculated structure factor that is correlated with the observed structure factor, and E e (the 'effective E') and D obs are derived nontrivially from the observed intensity and its standard ISSN 2059-7983 deviation (I obs and I obs , respectively) as described in detail in Read & McCoy (2016). The LLGI has a significantly higher signal to noise for molecular replacement than the amplitude-based LLG target (Bricogne & Irwin, 1996;Murshudov et al., 1997). The increase is particularly significant when the data are anisotropic and/or strongly modulated owing to the presence of noncrystallographic symmetry, when low-intensity reflections are important for the analysis but reflections with insignificant signal to noise cannot be removed with a simple resolution truncation. The LLGI also allows data beyond the traditional resolution limits to be included in the likelihood calculation, so that all data collected with significant signal to noise, regardless of resolution, can contribute to the signal.
The LLGI required to be confident in a solution for the placement of the first model in molecular replacement depends on the number of parameters that have to be fixed. The results from a database of over 22 000 molecularreplacement calculations, each placing a single model in the asymmetric unit, show that for nonpolar space groups (where the solution has six degrees of freedom) most solutions with an LLGI of 60 or greater are correct, whereas an LLGI of 50 is sufficient for polar space groups and an LLGI of 30 is sufficient for space group P1, i.e. an LLGI ten times the number of degrees of freedom is sufficient to be confident of success . For reference, we call these space-groupdependent LLGI values the solved-LLG values. LLGI values lower than the solved-LLG give proportionately lower confidence in a solution (see Fig. 1 in McCoy et al., 2017).
Since the value of the LLGI is directly related to the outcome of molecular replacement, the expected value of the LLGI for a correctly placed model for any given molecularreplacement problem will predict the outcome. Following McCoy et al. (2017), the expected value of the LLGI per reflection is a probability-weighted integral over the two unknown parameters E e and E C of the LLGI, which may be approximated as The approximation is particularly good for the low values of D obs A that characterize the cases of most interest, when the signal to noise in the molecular-replacement search is low. The eLLG is the sum for all reflections, Again following McCoy et al. (2017), the variance of the contribution of one reflection to the eLLG is 2 LLGI;hkl ¼ ðLLGI À hLLGIiÞ 2 hkl ' ðD obs A Þ 4 ' 2hLLGIi hkl : Numerical integrations show that the eLLG for a randomly (incorrectly) placed structure is approximately ÀeLLG for a correctly placed structure, also with a variance of approxi-mately twice the eLLG. The TFZ for a correct placement is therefore proportional to eLLG 1/2 . This reasoning is consistent with the results of database studies (Oeffner et al., 2013;McCoy et al., 2017), where a correct solution is equivalently indicated by a TFZ of 8 and an LLGI of 8 2 ($60) in nonpolar space groups, a TFZ value that has long been associated with indicating a correct solution (Table 1; McCoy et al., 2009), and a TFZ of 7 and an LLGI of 7 2 ($50) in polar space groups.
To calculate the eLLG it is necessary to estimate A . The resolution-dependent estimates of A depend on both the expected coordinate error (Á m ) and the expected fraction scattering (f m ) of the model. A Á m for proteins can be calculated from the sequence identity between the model and the target and the number of residues in the target (Oeffner et al., 2013), or inferred from other priors. f m is deduced by comparing the scattering matter in the model with the expected (ordered) contents of the asymmetric unit. The A estimation for eLLG calculation in Phaser is given by The dependence of A on the solvent term in square brackets in (6) is the square of the solvent term previously described (Read, 2001;McCoy et al., 2017), after studies indicated better A estimation using this functional form (data not shown).
These relationships between f m , Á m , the number of reflections and the eLLG give fresh insights into molecular replacement. Previously, we showed that the eLLG predicted the success of single-atom molecular replacement, which was borne out in the solution of the 1.39 Å resolution structure of residues 22-95 of Shisa3 (McCoy et al., 2017). We here show how the eLLG can be used more generally to optimize molecular-replacement strategies. Most obviously, the eLLG can be used to predict the outcome of molecular replacement with a model or set of models. We also discuss here the application to decisions regarding minimal data requirements, the burgeoning field of fragment-based molecular replacement, and likelihood-guided model pruning.

Phaser implementation
The applications discussed below are implemented from  Table 1 Guidance for the outcome of molecular replacement in Phaser for the placement of the first model in nonpolar space groups, showing the relationship between the translation-function Z-score TFZ and the LLGI (TFZ ' LLGI 1/2 ).  (Abrahams & Grosse-Kunstleve, 2003). Details of the implementation of each eLLG-based functionality described in the sections below are given in the relevant section.

Can I solve my structure by molecular replacement?
If the eLLG for placing a model in the asymmetric unit is well over the solved-LLG then structure solution is likely to be straightforward: high signal to noise and an unambiguous solution.
If the eLLG for placing a model in the asymmetric unit is approaching the solved-LLG then the solution will not distinguish itself clearly from noise. Molecular replacement with Phaser will generate a list of potential solutions rather than a single (correct) solution. The number of potential solutions will increase as the signal from the molecular replacement decreases. There is a sigmoidal relationship between the LLGI and the chance of a solution being correct (Oeffner et al., 2013;McCoy et al., 2017); half of the solutions with an LLGI equal to half of the solved-LLG are correct (Oeffner et al., 2013;McCoy et al., 2017). The solution list is likely to contain the correct solution (an enriched list), even though molecular replacement is not conclusive. It may be possible to distinguish the correct molecular-replacement solution in an enriched list by taking each potential solution through to refinement, particularly wide-convergence radius refinement as implemented in REFMAC jelly-body refinement (Murshudov et al., 2011), phenix.mr_rosetta (DiMaio et al., 2013 or phenix.den_refine (Schrö der et al., 2010).
When macromolecular entities in the asymmetric unit are represented by separate models, the molecular-replacement solution is built up by sequential addition; the eLLG can be used to predict the success of each step of molecular replacement. The molecular-replacement signal is predicted to be clear when the increase in the eLLG for the placement of a model (not necessarily representing the complete asymmetric unit contents) is over the solved-LLG. Note that the eLLG does not increase linearly as copies of a model are added. Rather, the eLLG increases in proportion to f m 2 ; adding a second copy of a model increases the eLLG to four times that of the first alone, so that, for example, the eLLG for a single copy of a model need only be 20 for the eLLG for two copies to be 80, yielding a change of 60 and corresponding to a potentially clear solution.

Implementation
Phaser lists the eLLG for the placement of the first copy of each search model. If models have already been placed in the asymmetric unit then the eLLG for the addition of another copy of each search model is reported.

Example using ARCIMBOLDO_LITE
The crystal structure of the carboxy-terminal domain of human translation initiation factor Eif5 (PDB entry 2iu1) in space group P2 1 2 1 2 1 contains 179 amino acids with 11 helical segments of lengths ranging from seven to 21 amino acids. Diffraction data to 1.7 Å resolution are available (Bieniossek et al., 2006). In ARCIMBOLDO_LITE , two polyalanine helices 14 amino acids in length are sufficient to phase the data after molecular replacement and density modification interspersed with autotracing with SHELXE (Usó n & Sheldrick, 2018). Assuming Á m = 0.2 Å , which is an appropriate value for a 14-residue helix in the context of ARCIMBOLDO, the eLLG is 12 for the placement of the first 14-amino-acid helix and increases to 48 upon correct placement of the second helix. In practice, LLGI values of 27 and 89 are obtained, associated with TFZ scores of 5.7 and 9.7 (cf. TFZ ' LLGI 1/2 ).

Search strategies
The eLLG calculation accounts for the trade-off between f m and Á m , in which small accurate models may give a higher eLLG than larger more inaccurate models. Searching for models in the order of decreasing eLLG should optimize the path to structure solution.
When there is more than one model to be placed in the asymmetric unit, search strategies benefit from knowing how many models need to be placed before a clear signal is expected, because if molecular replacement is failing then the search for many copies becomes highly branched and very slow. Using a database of 8762 two-component (heterodimeric) molecular-replacement trials, a clear signal for a correct molecular-replacement solution was found when the gain in the LLGI with the placement of the second component was the solved-LLG (Fig. 1).
Using the eLLG, molecular replacement can be initiated searching for the number of models for which placement of the last copy should increase the LLGI by the solved-LLG. If the increase in the LLGI reaches the solved-LLG then finding the remaining copies should be straightforward. If the LLGI does not reach the eLLG as expected, further (likely unproductive) search branching is curtailed. If more than one model is available for the target structure then alternative models can be rapidly screened without having to attempt complete structure solution with each.

Implementation
The default search order for the placement of multiple components in the asymmetric unit is by decreasing order of eLLG. However, if the search for the model with the highest eLLG does not yield a definite solution (implemented in Phaser as TFZ > 8) then the search for the first placement is repeated with models for other components of the asymmetric unit until a definite solution is found. If none of the components can be found with a definite solution then molecular replacement continues by building upon the placement of the highest LLGI-scoring first component.
Phaser calculates the eLLG for the addition of each model to the current contents of the asymmetric unit during a multicomponent molecular-replacement search.

Example
A mutant form of the four-helix-bundle protein ROP1 was originally solved by an extensive Monte Carlo search for four separate helices (Glykos & Kokkinidis, 2003). The eLLG values for one, two, three and four helices are shown in Table 2, and indeed the structure solution becomes straightforward after the placement of the third helix, where the increase in the eLLG is 84.

Resolution
At low resolution, where A is low owing to errors in modelling solvent and there are fewer reflections in each resolution shell, the eLLG rises slowly as the resolution of the data increases ( Fig. 2). At resolutions where d ) Á m each reflection contributes a similar amount to the eLLG, which therefore rises more rapidly with increasing d* (Fig. 2). At higher resolutions, the contribution to the eLLG from each reflection again drops, and reflections added at resolutions d < 1.8 Â Á m do not increase the eLLG significantly (Fig. 2). An effective eLLG limit is reached asymptotically, with the limit reached in any given case determined by the estimated Á m . This is as expected: the structure-factor contributions from the model are almost uncorrelated with those from the true structure when the Bragg spacing is much less than Á m . For reference, 1.8 Â Á m is called the Á m -limited resolution.
If the data resolution is less than that required to reach the solved-LLG and less than the Á m -limited resolution with any of the available models, molecular replacement is likely to be unsuccessful and therefore should not be pursued at length. The efforts of the crystallographer will be more usefully deployed exploring data-optimization strategies (see, for example, Heras & Martin, 2005;Alcorn & Juers, 2010).
Conversely, the eLLG calculated using all of the data may exceed the solved-LLG, in some cases by orders of magnitude. If this is the case then the resolution of the data used for molecular replacement can be cut substantially without jeopardizing a successful outcome. Since the time taken to calculate the LLGI is proportional to the number of reflections, reducing the number of reflections increases the speed of molecular replacement very significantly.
However, in cases where the coordinate error is higher than expected and/or the fraction of the scattering is lower than expected then the LLGI obtained will be lower than the eLLG. If the data do not reach the Á m -limited resolution, truncation of the data using the eLLG will be too severe, leaving too few reflections for successful molecular replacement; molecular replacement must then be repeated with more (all) data included, making the total time for molecular Increase in the eLLG with resolution (orange line) for a model with Á m = 1.0 Å and a data set with 10 000 reflections to 2.0 Å resolution. An eLLG of 64 (greater than the solved-LLG) is achieved at 5.8 Å resolution. A contrasting case (blue line) shows the increase in the eLLG for a model with Á m = 1.7 Å . The eLLG will at best be 40.4 (less than the solved-LLG); however, this value is reached asympotically and including data with resolution higher than 3.0 Å (1.8 Â Á m ) will not increase the eLLG significantly.

Figure 1
Confidence in the molecular-replacement solution for the placement of two components in the asymmetric unit. The increase in the final refined LLGI score (ÁLLGI = LLGI 2 À LLGI 1 , where LLGI 1 is for the placement of the first component and LLGI 2 is for the placement of both components) provides a clear diagnostic for success in molecular replacement (8762 trials). Table 2 ROP1 (Glykos & Kokkinidis, 2003) solved with a 25-residue polyalanine helix.
The LLGI values achieved in the search follow the eLLG values predicted from an Á m of 0.3 Å , an appropriate value for a helix of this length, and a scattering fraction of 0.14. The TFZ exceeds 8 for the placement of the third helix, when the increase in the LLGI (ÁLLGI) is 84. replacement greater than if more (all) data had been used from the outset. The eLLG used to determine the resolution for data truncation is called the target-eLLG. Rather than using the solved-eLLG as the target-eLLG for data truncation, higher target-eLLG values can be used (which give a higher resolution for data truncation than the solved-eLLG). To optimize the target-eLLG for the total time to solution, a database of 331 molecular-replacement calculations which did not reach the Á m -limited resolution was mined after varying the target-eLLG (Fig. 3). A target-eLLG of 225, corresponding to a TFZ of 15, optimized the average speed. For reference, we call this the optimal-target-eLLG.

Implementation
By default, all analyses based on the eLLG are performed with the target-eLLG set to the optimal-target-eLLG. Lower or higher target-eLLG values can be set for any given analysis, but should be greater than the solved-LLG.
In automated molecular replacement, Phaser limits the resolution of the data to the resolution required to achieve the target-eLLG (optimal-target-eLLG) and does not include data beyond the Á m -limited resolution. However, the factor of 1.8 applied to Á m for calculating the Á m -limited resolution is decreased to 1.5 for automated molecular replacement, because refinement of the coordinate error may reduce the coordinate error from the expected value (Á m ). If a definite solution (TFZ > 8) is not obtained then the search is repeated using all data.

Example
Ribosome structures crystallize in large unit cells and so have many more reflections to a given resolution than structures crystallizing in smaller cells. The structure of the hybrid state of the ribosome in complex with the guanosine triphosphatase release factor 3 (PDB entry 3zvo) can be solved with the 30S (PDB entry 2j00) and 50S (PDB entry 2j01) components of the structure of the Thermus thermophilus 70S ribo-some complexed with mRNA, tRNA and paromomycin (Selmer et al., 2006;Jin et al., 2011). The data extend to 3.6 Å resolution. The coordinate error between the model and the target is predicted to be 0.67 Å (Oeffner et al., 2013). The percentages of the scattering represented by the 50S and 30S subunits are 45 and 27%, respectively, with one ribosome in the asymmetric unit. The eLLGs for the 50S and 30S components reach the target of 225 at resolutions of 9.2 and 8.1 Å , respectively.

Fragment-based molecular replacement
Fragment-based molecular replacement for proteins has its origins in the solution of helical proteins by placing short polyalanine helices (Glykos & Kokkinidis, 2003;Rodríguez et al., 2009). A similar method was developed for RNA, using canonical RNA structure motifs to build full solutions (Robertson et al., 2010). Much recent work has focused on the generation of more general structural fragments, including those from distant homologues (ARCIMBOLDO_SHREDDER; Sammito et al., 2014;Millá n et al., 2018), libraries of structural motifs (ARCIMBOLDO_BORGES; Sammito et al., 2013) or molecular modelling (AMPLE; Bibby et al., 2012). These methods rely on the generation of small but extremely accurate (low coordinate error) fragments, followed by expansion of the placed fragments using aggressive density-modification and model-building methods, such as those implemented in SHELXE (Sheldrick, 2010).
In fragment-based molecular replacement, the coordinate error is not accurately estimated from sequence identity, and so the eLLG cannot accurately estimate the LLGI. However, the eLLG can answer a different question: 'If the expected coordinate error between my fragment and the structure is a certain value, then what size fragment will I need for successful molecular replacement?' The fragment library should have fragment sizes tailored to the problem at hand, with an appropriate trade-off between f m and Á m for the data available.
Fragment-based molecular-replacement strategies can be successful even when the eLLG per fragment is much lower than the solved-LLG, and when molecular replacement will only provide an enriched solution list. Strategies to identify the correct solution may include considering the persistence of solutions in solution lists from alternative, but similar, fragments. Key to structure completion in these cases is the application of density-modification, chain-tracing and refinement procedures.

Implementation
Phaser reports the number of polyalanine residues required to reach the target-eLLG (default optimal-target-eLLG) for an input Á m (or set of input Á m ). This number, when calculated in advance of fragment generation, can be used to design bespoke fragment sizes for each molecular-replacement problem. Average Phaser runtime for structure solution for 331 successful molecular-replacement test cases versus the eLLG used to determine the resolution of the data used for molecular replacement. The optimaltarget-eLLG for minimizing the total Phaser runtime was 225.

Example using ARCIMBOLDO_SHREDDER
The structure of the peptidylarginine deaminase from Porphyromonas gingivalis (PDB entry 4yt9) contains 432 residues. It can be solved with fragments drawn from a putative arginine deiminase from the same organism (PDB entry 1zbr), sharing 19% sequence identity and a Á m of 1.5 Å over a core of 231 C atoms . The data in space group P2 1 2 1 2 1 were obtained from a combination of 16 data sets and extended to 1.5 Å resolution. Aiming to find fragments capable of developing into a full solution, Á m was set to 0.8 Å , so that polyalanine models of 101 residues reached an eLLG of 60. ARCIMBOLDO_SHREDDER prepared spherical fragments of PDB entry 1zbr for molecular replacement of 101 residues, and in the course of the ARCIMBOLDO_SHREDDER process (Millá n et al., 2018) placed models are given internal degrees of freedom or undergo likelihood-guided pruning (see below) in order to further reduce the Á m and allow successful density modification and expansion.

Single-atom molecular replacement
A single atom is a perfect partial model (Á m = 0). For such a model, A 2 / f m and hence LLGI / f m 2 . Molecular replacement with a single atom, when the structure is large and f m is small, requires many reflections because as the number of ordered atoms in the asymmetric unit increases, the LLGI per reflection decreases (/ f m 2 ) faster than the number of reflections increases for a proportional unit-cell volume (/ f m ). More reflections may come from higher resolution data or a larger unit cell with the same number of scattering centres (higher solvent content). Since f m also depends on the scattering curve, atoms of the same element type but with lower B factors will be found with a higher LLGI than those with high B factors. Also affecting the scattering factor are the form factors; with regard to protein, S atoms scatter proportionately more at higher resolution than C, N and O atoms. This effect, however, can be negated by a B factor raised by as little as 2 Å 2 above the Wilson B factor (Wilson, 1942). Se atoms in selenomethionine-incorporated proteins are poorer targets for single-atom molecular replacement than their atomic number suggests (Z = 34), since selenomethionine residues often display high mobility or disorder (Dauter & Dauter, 1999).
Single-atom molecular replacement for proteins will be most likely to succeed when the data extend to high resolution, when there is high solvent content and when an S (or heavier) atom is present with a B factor lower than the Wilson B factor. Direct methods also require high-resolution data (resolved atoms). However, single-atom molecular replacement differs from direct methods in that it does not assume equal atoms, and the likelihood basis for the LLGI inherently takes account of the quality of the available data and the nature of the model. The LLGI for single atoms can reach into double digits in favourable cases. Because of the quadratic dependency of the LLGI on f m , the placement of as few as two or three single atoms may give an unambiguous substructure. Structure solution can be completed with peak picking from log-likelihood-gradient maps (McCoy et al., 2017).

Implementation
For single-atom models, Phaser lists the eLLG for the requested search atom type, taking account of the form factors of the atom type relative to the average scattering from protein or nucleic acid, depending on the composition entered. The eLLG is reported for a range of B factors downwards from the Wilson B factor in steps of 0.5 Å 2 until the optimal-target-eLLG is reached. This indicates the enrichment that is likely to be obtained by the placement of a first atom that is slightly more ordered than the average atom, and hence how many atoms need to be placed to reach the optimal-target-eLLG.

Example
The N-terminal domain of mouse Shisa3 (PDB entry 5m0w) can be solved by single-atom molecular replacement . S atoms are the heaviest atoms in the structure, and the eLLG values for S atoms that are more ordered than the Wilson B factor are shown in Fig. 4. The eLLG is 5 for S atoms with a relative Wilson B factor of just À2 Å 2 . Seven S atoms were identified by molecular replacement with Phaser. Log-likelihood-gradient completion in Phaser succeeded in expanding the Shisa3 structure to a total of 56 atoms, mostly well ordered main-chain O and N atoms. The resulting phases were suitable for structure completion through density modification and model building.

Likelihood-guided pruning
Editing of structures from the Protein Data Bank prior to molecular replacement is a well established method for improving the signal, and often makes the difference between success and failure (Schwarzenbacher et al., 2004 Single-atom molecular replacement for Shisa3 (PDB entry 5m0w) (McCoy et al., 2017). The eLLG for a single S atom depends on how well ordered it is, as measured by the difference between its B factor and the Wilson B factor. Read, 2011;Bunkó czi et al., 2015). Editing methods range from simple truncation of side chains in the model (polyalanine or polyserine), through the selected removal of atoms based on side-chain substitution, removal of loops and altering B factors, to full molecular modelling. At the end of molecular replacement, model editing usually occurs as one of the first steps in structure refinement.
Refinement of atomic occupancies when the phase error is high is not a traditional step during molecular replacement because of the danger of overfitting. The eLLG provides a metric for avoiding overfitting; overfitting is avoided by refining the occupancy of blocks of n residues, with n determined by the number of residues that give a significant change in the eLLG, i.e. the occupancies of n residues are constrained to be the same during occupancy refinement. Note that the reduction in the eLLG (ÁeLLG) owing to the removal of n residues from a model, where n is a small fraction of the total number of residues, is much greater than the eLLG of the placement of the first n residues in the asymmetric unit because of the quadratic dependency of the eLLG on the model size. This likelihood-guided pruning is possible for lowresolution data and/or very incomplete models, even when atomic occupancy refinement would not be justified by the data. This includes cases where not all components of the asymmetric unit have (yet) been placed; where multiple copies of a model are present, pruned models can be used as models for the placement of other copies.
The careful parameterization of likelihood-guided pruning can be compared with B-factor refinement, which must also be carefully parameterized to account for the amount of data present (Merritt, 2012). Strategies to constrain B-factor refinement include group B-factor refinement and TLS refinement (Merritt, 2012), and are usually chosen heuristically. In likelihood-guided pruning there are no heuristics: the parameterization of the occupancies is directly determined by the data.
Likelihood-guided pruning has two applications. Firstly, the use of likelihood-guided pruning during molecular replacement can relieve packing clashes when the models contain atoms that are outside the true molecular envelope. Secondly, the use of likelihood-guided pruning after molecular replacement will accelerate model building and refinement because the process is started from a better model and a better-phased electron-density map.
Likelihood-guided pruning removes atoms that are positioned in solvent regions of the crystal, highly disordered regions of a crystal or regions where the local coordinate error is high. The chemical bonding of atoms is not considered during pruning. Where atoms accurately fill a volume in the crystal pruning will not remove these atoms, even if the placed model does not have the correct atomic types or bonding. This may include cases where the model partly overlies a target and partly overlies a symmetry-related copy of the same target, or partly overlies a different target. Where there is a packing clash between placed models, and more atoms filling a small volume of the asymmetric unit than chemically possible, then likelihood-guided pruning will remove atoms solely on the basis of which ones more accurately represent the true positions of the atoms. It is thus possible that during likelihoodguided pruning the 'wrong' residues are removed, where 'wrong' can only be defined in the context of a priori information that is not available to the pruning algorithm, such as sequence differences between model and target or the likely disorder of residues. Note that similar reasoning could be employed in parameterizing model building and structure refinement more generally.
The change in the eLLG for determining n (the target-ÁeLLG) was found by probing a database of 8966 molecularreplacement calculations (Oeffner et al., 2013) for the minimal ÁeLLG that improved the electron-density map without overfitting the data (Fig. 5). Occupancy refinement was performed in Phaser with n = 1. The purpose of taking n = 1 for the window size was to generate a range of ÁeLLG for the analysis, not to test whether or not n = 1 was the appropriate window size; since the model Á m and the per-residue f m were different for each model and target combination, the ÁeLLG was also different for the removal of single residues in each test case. Real-space correlation coefficients (RSCCs) were calculated with respect to the electron density calculated with phases from the refined structure deposited in the PDB (the 'true' map), which were assumed to have low phase error. Then, where RSCC pruned is the RSCC between the 'true' map and the electron density calculated with phases from the pruned model and RSCC unpruned is the RSCC between the 'true' map and the electron density calculated with phases from the unpruned model. Where ÁRSCC was negative, overfitting was indicated. The mean (hÁRSCCi) and standard deviation ( ÁRSCC ) of the distribution of ÁRSCC were calculated in narrow windows of ÁeLLG (Fig. 5). As expected, hÁRSCCi increased with increasing values of ÁeLLG, and ÁRSCC (7) for 8966 successful molecular-replacement test cases consisting of 1526 targets and between one and 33 models per target (with an average of six models per target). The mean and standard deviation of the distribution of ÁRSCC was calculated in narrow windows of ÁeLLG. The mean (orange line) and one standard deviation either side of the mean (yellow lines) are indicated.

Implementation
Likelihood-guided pruning is currently implemented for protein chains only. When the model is an ensemble of two or more proteins, pruning is performed on the single best model (i.e. the model with the lowest Á m ). The number of residues n to remove to obtain the target-ÁeLLG (by default, the minimal-target-ÁeLLG) is determined. Occupancies are refined in windows of n residues for each offset of the window along the protein chain (incremented by single residues). The occupancies of equivalent residues under NCS are not constrained to be the same, because differences in the refined occupancies between NCS copies are valid indicators of differences in crystal packing. The results for each offset of the window are combined by averaging the per-residue occupancy for each offset. This gives the occupancy-refined structure with per-residue occupancies in the range (0.01, 1). The occupancyrefined structure is then converted to a pruned structure, where the occupancies take binary values 0/1 (0 being residues that are pruned) by the application of an occupancy threshold above which the refined occupancies are set to 1 and below which they are set to 0. The optimal threshold is selected by testing thresholds and calculating the LLGI for the model pruned at each value, choosing the threshold generating the highest LLGI. Two coordinate files are output: the pruned structure with occupancies 0/1 and the occupancy-refined structure with occupancies in the range (0.01, 1). The former is ideal for taking forward into model building and refinement, since these expect models with all atoms having occupancy 1. Electron density calculated from the latter may give electrondensity maps with lower phase error than those calculated from the former.
As implemented in Phaser, the packing test is a pass/fail test based on a pairwise clash score for the trace points [i.e approximately 1000 points representing all atoms, C atoms or a hexagonal grid of points filling the Wang volume (Wang & Janin, 1993), depending on the protein size (McCoy, 2017)]. The trace points for the protein are regenerated after likelihood-guided pruning and, since the trace points after likelihood-guided pruning more accurately represent the true atomic volume, solutions with high TFZ discarded for failing the packing test (with the incorrect atomic volume) can be rescued. Likelihood-guided pruning is run by default in the automated molecular-replacement model (MR_AUTO) if the only solution that is obtained has TFZ > 8 but does not pack successfully.

Example
The structure with PDB code 2hh6 (112 residues), a protein from Bacillus halodurans of unknown function, was modelled as part of the seventh Critical Assessment of Techniques for Protein Structure Prediction (CASP7 target T0283). The model T0283TS020_2 and target 2hh6 differ significantly at several places (Fig. 6a). At the N-terminus, the first 27 residues of 2hh6 form a continuous helix that starts beyond the body of the protein, but in T0283TS020_2 the first two turns of this helix are modelled as a short helix folded back against the body of the protein. At the C-terminus, the last 22 residues of 2hh6 form a loop followed by a three-turn helix, but in T0283TS020_2 these residues are modelled as a shorter loop followed by a five-turn helix which do not overlie 2hh6, and indeed run in the opposite direction away from the true Likelihood-guided occupancy refinement for PDB entry 2hh6 solved by molecular replacement with CASP7 model T0283TS020_2. (a) The structure 2hh6 (reference copy, ribbon representation, colour ramp from blue to red from the N-terminus to the C-terminus), the N-terminal helix of a symmetry-related copy of 2hh6 (worm representation, colour ramp as for the reference copy) and the result of likelihood-guided occupancy refinement of the placed model showing occupancies per residue ranging from 1 (black) to 0 (purple). The regions where 2hh6 and the model diverge are the regions where the refined occupancies are close to 0 (the model is shown in purple), and conversely where they coincide the refined occupancies are close to 1 (the model is shown in black). The window size for occupancy refinement was five residues, determined by the optimaltarget-ÁeLLG. This figure was produced with CCP4mg (McNicholas et al., 2011). (b) The difference between the LLGI for the placed model T0283TS020_2 before and after removing five residues centred on each residue along the chain (blue line). The RSCC per residue is shown between the placed model T0283TS020_2 and the 'true' map (see text; orange line) and between the placed model T0283TS020_2 and the 'model' map (see text; dotted green line). The RSCC of the model to the 'true' map is better predicted by the change in LLGI (orange line versus blue line) than by the RSCC to the 'model' map (orange line versus dotted green line). structure (Fig. 6a). Five residues of T0283TS020_2 represent 4.5% of the total scattering and give ÁeLLG = 4.9. Pruning based on a window size of five residues removes residues at the N-terminus and C-terminus where the model and target diverge (Fig. 6a). The change in the LLGI owing to the removal of five residues is much more predictive of the model quality along the chain, as judged by the RSCC of the model against the 'true' map (defined in x7), than is the RSCC between the model and the electron density calculated using phases from the unpruned model (the 'model' map; Fig. 6b). The change in the LLGI is therefore a better indicator of model quality along the chain than the RSCC between the model and the model-phased electron density, as is traditionally used.

Example
A test case using polypeptide -N-acetyl-galactosaminyltransferases shows the use of likelihood-guided pruning to remove packing clashes [target PDB entry 1xhb (Fritz et al., 2004) and model PDB entry 2d71 (Kubota et al., 2006)]. The sequence identity between the model and the target is 45%. The transferase structures consist of two domains and these have a different hinge angle in the model and target structures. A model was prepared from PDB entry 2d71 using Sculptor (Bunkó czi & Read, 2011). In the default MR_AUTO mode, Phaser finds a solution with high TFZ, but the hinge angle between the domains manifests itself as a clash in the packing of this solution. After automatic likelihood-guided pruning, the majority of the residues in the smaller domain of 2d71 are removed and the pruned model passes the packing test (Fig. 7).

Twinning
Twinning reduces the LLGI, and so a correction term should, in principle, be applied to the eLLG. The reduction in the eLLG was studied for hemihedral and tetartohedral crystal twinning, which are particular cases of (pseudo)merohedral twinning where the number of twinned domains is two and four, respectively. The BETA-BLIP structure (Strynadka et al., 1996), which has previously been used as a test case for Phaser (Storoni et al., 2004;McCoy et al., 2005;McCoy, 2007), was used to generate simulated data with different hemihedral twin fractions, and the LLGI was calculated for the structure given the simulated data (Fig. 8a). The relationship between the LLGI and the twin fraction is approximately linear for hemihedral twinning, so that a twin fraction of a half leads to a halving of the LLGI for untwinned data. A higher order twinning test was performed with the structure of human complement factor 1 (PDB entry 2xrc), which has P1 symmetry and tetartohedral twinning. For perfect tetartohedral twinning the degree of reduction in the LLGI was a factor of four (Fig. 8b).

Implementation
Since the presence, order and/or fraction of twinning cannot be determined with certainty in advance of structure solution, even if twinning is indicated the eLLG is not decreased. Indeed, other data pathologies, which are often associated with twinning, may make molecular replacement more diffi- The molecular-replacement solution of PDB entry 1xhb (grey) solved with PDB entry 2d7i (red and blue) after likelihood-guided pruning of the placed 2d7i model, where blue indicates an occupancy of 1 and red indicates an occupancy of 0. The symmetry-related copy of 1xhb that clashes with the model after initial molecular replacement is shown in gold.

Figure 8
The LLGI as a function of the twin fraction for calculated data, showing that the LLGI of the molecular-replacement solution decreases in proportion to the twin fraction. (a) The LLGI as a function of the hemihedral twin fraction for calculated data for the test case of the -lactamase (BETA)--lactamase inhibitor (BLIP) complex (Strynadka et al., 1996). (b) The LLGI as a function of tetartohedral twinning for calculated data for the test case of human complement factor 1 (Roversi et al., 2012). cult than expected. If molecular replacement fails with twinned data, it may be helpful to increase the target-eLLG.

Discussion
Experienced users of Phaser may wish to see a solution with LLGI ) 64 and TFZ ) 8 to increase the certainty that the solution is correct. While an LLGI > 64 and a TFZ > 8 have been proven to be significant, a target-eLLG of 225, equivalent to TFZ = 15, was found to optimize the time to structure solution. It is likely that the preference of the experienced user for LLGI ) 64 and TFZ ) 8 is partly informed by their experience of the time taken for structure solution, rather than the outcome. To give the user additional information about the certainty of a solution after automated molecular replacement with Phaser, a 'TFZ-equivalent' is calculated, which is the TFZ that would have been obtained if the refined position were found (i.e. located exactly on the search grid) in a translation function performed with the model in the refined orientation, using all data.
Pathologies in the data that violate the assumptions of the likelihood function have a severe impact on the likelihood estimates. The eLLG will be an accurate estimator of the LLGI when data are isotropically distributed with a Wilson distribution. Data anisotropy (Murshudov et al., 1998) and many forms of translational noncrystallographic symmetry (tNCS) modulations Sliwiak et al., 2014) can be accounted for. However, when the data contain uncorrected pathologies, the use of the eLLG to lower the resolution for molecular replacement may cause solutions to be missed; incorrect placements obtained with the minimal number of reflections that have TFZ > 8 must be avoided with the Phaser automated search algorithm, because the placement will be taken to be correct and the search terminated.
The order of the tNCS is not used to increase the f m for the eLLG calculation . By default, Phaser places the number of tNCS-related molecules in one step of the rotation and translation functions. The f m for a single copy could thus be multiplied by the number of tNCS-related copies in the calculation of the eLLG. The eLLG-truncated resolution will thus be higher than necessary to achieve the eLLG in the presence of tNCS. However, errors in the modelling of the tNCS during the rotation and translation function, particularly when the tNCS relates more than two copies, means that conservative resolution truncation is prudent.
Poor estimates of A will degrade the accuracy of the eLLG. Estimates of A depend on both Á m and f m . The Á m estimated from the sequence identity between the model and the target and the number of residues in the target (Oeffner et al., 2013) has an associated error with a fractional standard deviation of 0.2. In the future, it may be possible to incorporate the uncertainty in the Á m estimation into the eLLG estimate. The eLLG analysis also assumes that the B factors of the components are equal to the Wilson B factor. Differences between the two manifest as errors in f m . Uncertainties in Á m and search B factor may be accounted for by performing a grid search over these estimates rather than relying on a single estimate. Note that the input values of Á m and search B factor are only important until a solution is found and retained in the potential solution list, even with low signal to noise, because the Á m and B factor are refined (to optimize the LLGI) at the end of molecular replacement in Phaser.
The eLLG only provides a metric for the likely success or failure of molecular replacement. It does not provide a metric for whether or not a molecular-replacement solution can be converted into a completed, validated structure suitable for publication and deposition in the PDB. High-resolution data beyond those required for successful molecular replacement will often be required to reduce model bias. It may be possible to develop other likelihood-based metrics for determining the limits on the structure quality possible with the data available.
Judicious use of the eLLG for decision making in molecular replacement should reduce the time to structure solution in most cases. It should also guide the development of more efficient automated molecular-replacement pipelines, particularly those based on fragment libraries.