## research papers

## Likelihood-based docking of models into cryo-EM maps

^{a}Department of Haematology, Cambridge Institute for Medical Research, University of Cambridge, Hills Road, Cambridge CB2 0XY, United Kingdom, and ^{b}New Mexico Consortium, Los Alamos National Laboratory, 100 Entrada Drive, Los Alamos, NM 87544, USA^{*}Correspondence e-mail: rjr27@cam.ac.uk

Optimized docking of models into cryo-EM maps requires exploiting an understanding of the signal expected in the data to minimize the calculation time while maintaining sufficient signal. The likelihood-based rotation function used in crystallography can be employed to establish plausible orientations in a docking search. A phased likelihood translation function yields scores for the placement and rigid-body

of oriented models. Optimized strategies for choices of the resolution of data from the cryo-EM maps to use in the calculations and the size of search volumes are based on expected log-likelihood-gain scores computed in advance of the search calculation. Tests demonstrate that the new procedure is fast, robust and effective at placing models into even challenging cryo-EM maps.Keywords: likelihood; cryo-EM; docking; information gain.

### 1. Introduction

Advances in cryo-EM hardware and software are improving the resolution and quality of cryo-EM maps, often yielding maps that allow model building from scratch. Nevertheless, for various sample-specific and technical reasons, a substantial proportion of cryo-EM maps from single-particle reconstructions and a larger proportion of maps from subtomogram averaging lack the necessary resolution and quality for *ab initio* model building. In this situation, the density may be interpreted by docking one or more pre-existing experimental or predicted atomic models. We explore here the development of a new likelihood-based docking tool, *em_placement*, to fill this need. The accompanying paper (Read *et al.*, 2023) builds the theoretical background for the required likelihood targets and the statistical calculations underlying automation features of this software.

A large number of tools have been developed to carry out manual or automated docking. The automated tools include *DockEM* (Roseman, 2000; Titarenko & Roseman, 2021), *Situs* (Kovacs & Wriggers, 2002; Wriggers, 2012), *Powerfit* (Zundert & Bonvin, 2015), *OffGridFit* (Hoffmann *et al.*, 2017), *phenix.dock_in_map* (Liebschner *et al.*, 2019) and *MrBUMP* (Simpkin *et al.*, 2021). *DockEM*, *Powerfit* and *phenix.dock_in_map* all carry out an exhaustive exploration of orientations. *Situs* and *OffGridFit* both use six-dimensional (6D) fast Fourier transform (FFT)-based algorithms for an exhaustive 6D search. Among these, *MrBUMP* is unique in carrying out the translation search first with the spherically averaged phased translation function, followed by an orientation search centred on the point found in the translation search.

We have not attempted to carry out head-to-head comparisons of our software with existing tools for two reasons. Firstly, the half-maps needed for our approach are not generally available for the published test cases for existing tools. Secondly, we are not experts in the use of the other tools and would therefore not be able to show them to their best advantage.

### 2. Expected LLG-based search strategy

Docking problems can differ dramatically in their difficulty, from trivial cases in which distinctive features of the search model can be spotted by eye in an excellent map to extremely challenging cases where there is barely enough signal to recognize that a docked model agrees with a very noisy map. Great gains can be made in the efficiency and effectiveness of docking calculations by adopting a case-dependent strategy that is informed by considering the value of the log-likelihood-gain (LLG) score that would be expected for a correct solution given the quality of the data and the model, which we term the expected LLG or eLLG. The accompanying paper (Read *et al.*, 2023) provides the details of how these are defined and computed.

In crystallographic *et al.*, 2017; Oeffner *et al.*, 2018). In cryo-EM, after correcting for oversampling, our experiences in the tests below suggest that a similar threshold applies for identifying correct, or at least nonrandom, solutions. Given uncertainties about the sizes of coordinate errors prior to structure solution, trials of different choices in a database of MR problems showed that it is more efficient, overall, to choose strategy parameters expected to give a higher LLG score than 60, with 225 being a choice that works well to balance an increased initial search cost with a lower chance of having to rerun an unsuccessful search with modified parameters.

The pivotal decisions in the docking search strategy are determined by the rotation search, because it gives the lowest signal to noise; if this search is expected to succeed (or at least to give sufficient signal that a chosen subset of orientations is likely to include the correct orientation) then the subsequent translation search will be almost certain to succeed.

For this reason, the strategy decisions discussed below are primarily driven by considerations of the LLG signal expected in the rotation search, eLLG_{rot}. eLLG_{rot} depends on the resolution of the map and the fraction of the ordered volume in that map explained by the model. For a model covering only a small part of the ordered volume in a low-resolution map, the eLLG_{rot} signal can be very low. Fortunately, cryo-EM differs from crystallography in that the phase information allows a search to be confined to only a part of the full map, which reduces noise in the rotation search and increases eLLG_{rot}.

Given uncertainties in the eLLG calculations, particularly in estimating the quality of the search model in advance, searches that aim for the minimum required LLG have a significant chance of failure and it is safer to be somewhat more conservative so that fewer searches need to be repeated. For the rotation search, an LLG score of 30 or more is expected to correspond to a correct solution, as this is equivalent to the search score required for confidence in a crystallographic MR search in *P*1 (McCoy *et al.*, 2017). As a more conservative estimate, the initial target eLLG_{rot} is set to 60.

#### 2.1. Searching over the whole map with one rotation search

A major decision in the search strategy is whether a rotation search over the whole map is likely to succeed. For good maps and good models that comprise a sufficient fraction of the total structure, a strong signal will be expected in the rotation search. Searches can then be carried out over the whole map, but the efficiency is optimized by default in *em_placement* by limiting the resolution to that required to achieve eLLG_{rot} = 60. This is a value that will usually yield an unambiguous and precise orientation. In principle, an even lower resolution limit for Fourier terms could be used in the translation search, but in practice the translation search is computationally very efficient and only a few translation searches will be required if there is good signal in the rotation search. When it is not possible to achieve the ideal value of eLLG_{rot} = 60, lower levels of signal are still useful. Even if eLLG_{rot} = 7.5, the correct orientation is likely to be found in an orientation list of modest size, so carrying out a rotation search followed by a translation search over the entire map is only abandoned if eLLG_{rot} < 7.5.

#### 2.2. Searching over subvolumes

If it is not judged possible to search successfully for rotations using the full map, a decision is made whether it will be possible instead to find a solution by searching over subvolumes. The target subvolume is set according to the inverse relationship (Read *et al.*, 2023) between the size of the subvolume and the eLLG_{rot} that would be achieved by searching in that subvolume (if it contained the object being sought). As a simple example, if a value of 3.75 were found for eLLG_{rot} when computed over the whole map, the eLLG_{rot} for a map containing one-half of the total volume would be 7.5, which is the default for the minimal acceptable value. This calculation depends on the assumption that one of the subvolumes will contain the entire object being sought, so there is a lower limit to the smallest relevant subvolume. It is also implicitly assumed that the map quality in local regions is not much worse than the overall average map quality. This can lead to failures when the component being sought corresponds to a poor part of the map. Note that there is a practical limit to how small a subvolume can be; the number of overlapping subvolumes required to ensure that at least one of them contains the full volume of the model increases dramatically once the search volume is less than about 1.15 times the volume of the sphere enclosing the model. When the required search volume would be unfeasibly small, the brute-force search discussed below is invoked.

When a suitable size has been defined for the subvolumes (*i.e.* a size expected to achieve the minimal eLLG_{rot} of 7.5), target subvolumes for docking searches are constructed as follows. Firstly, a hexagonal close-packed grid is defined such that spheres with the target volume that are centred on the grid points will overlap sufficiently that at least one of the spheres is guaranteed to cover the volume containing the target object. Secondly, any spheres that lack sufficient ordered volume (defined as regions of the map with high local variance) to contain the search object are discarded. Following this, the spheres of density are analysed to evaluate signal and noise (to calibrate the likelihood targets) using the program *prepare_map_for_docking* described in the accompanying paper (Read *et al.*, 2023), and rotation and translation searches are then carried out followed by rigid-body To avoid Fourier artefacts from sharp boundaries in the map, the target sphere is cut out inside a cube that is large enough to allow a smooth masking of the density to the edges.

#### 2.3. Brute-force six-dimensional search

If the rotation search cannot be carried out with sufficient signal even with subvolumes, then the final fall-back in the search algorithm is to carry out a brute-force six-dimensional search. To make this search as efficient as possible, data are used only to the resolution required to obtain a value of eLLG_{tra} sufficient to yield a clear solution for the correct combined rotation and translation. Based on experiences with crystallographic MR, searches given an LLG of 60 should almost always be correct, but to be safe the target for eLLG_{tra} is set to 225, a value that has also been adopted for crystallographic MR in *Phaser* to give a good compromise between efficiency and the danger of missing the solution. Using the lowest resolution possible improves the efficiency by allowing orientations and translations to be sampled more coarsely and by reducing the number of Fourier terms over which the likelihood scores must be calculated. Even so, it is not uncommon for such a brute-force search to take hours to run.

#### 2.4. Focused docking

The final step in the docking strategy is to evaluate all potential docking solutions in a common framework. Docking poses that have an LLG score within some tolerance of the top score are retained as potential solutions. The tolerance is chosen by approximating the standard deviation of the LLG score as the square root of that score (McCoy *et al.*, 2017) and allowing potential solutions to deviate by as much as seven times this standard deviation. For each potential solution, the size of the sphere of density required to accommodate the entire search model is evaluated, a sphere of density of this size is cut out and analysis of the signal and noise is performed; a rigid-body is then carried out to obtain an LLG score, a final model placement and a map correlation with the processed density sphere. After the focused docking calculation, the list is pruned again based on the new top LLG score. By default, a maximum of five potential solutions are retained.

Two types of map coefficient (equations 18 and 19 in the accompanying paper; Read *et al.*, 2023) for the processed density sphere have been evaluated in the set of tests described. The first type (**F**_{map} = *D*_{obs}**E**_{mean}) should give a map that minimizes the error from the true sharpened map because it represents the expected value of the Fourier coefficient for such a map. The second type { = } includes an additional weighting term from the likelihood target and therefore gives a map for which the correlation to a sharpened map computed from the docked model should be roughly proportional to the likelihood score for that model. To compute the second map, a choice has to be made for the value of σ_{A}. The current default is to assign a value of 0.9, which would correspond to a model that accounts for about 80% of the scattering in the volume under consideration but has no other errors. The choice for σ_{A} could potentially be improved by considering deficiencies in the ability of atomic models to account for the bulk-solvent region. The second choice for map coefficients yielded higher map correlations than the first choice in the test calculations reported below. Qualitatively, the blurring that comes from giving higher weight to well determined (typically lower-resolution) Fourier terms seems to give more readily interpretable maps, in line with the map-correlation values. The second choice, therefore, is the default and was used for the map-correlation calculations reported below.

### 3. Methods

#### 3.1. Target selection

A set of single-particle cryo-EM structures was chosen that would convey a representative sample of experimental reconstructions covering a wide range of nominal resolutions (*d*_{min}) from 1.7 to 8.5 Å and of symmetry conditions (1–24 symmetric copies). The test cases were restricted to Data Bank (EMDB; Lawson *et al.*, 2016) entries for which half-maps had been deposited. Table 1 shows a summary of the selected test cases.

‡Number of symmetry-related copies (or pseudo-symmetric copies if in parentheses). §Models from PDB entries are defined in terms of the PDB identifier, optionally followed by a chain identifier and/or a range of residue numbers. AF indicates an AlphaFold model. ¶Fraction of the entire ordered volume explained by one copy of the model. ††Structure of Get3 in the open conformation. |

#### 3.2. Model selection

Models were selected to cover a variety of scenarios. Some models correspond to what could be called `reference' models, in the sense that they are the deposited models associated with the EMDB entry; these provide a reference docking with nearly zero rotation or translation. Others correspond to crystal structures of the same protein. Finally, we have tested some predicted models produced by *AlphaFold* (AF; Jumper *et al.*, 2021); such models will be used frequently, so understanding how they should be treated and how they will perform in our algorithm is essential. In all cases, we processed the predicted models with the *process_predicted_model* tool (Oeffner *et al.*, 2022), which replaces the predicted values for the local distance difference test (Mariani *et al.*, 2013), or pLDDT values, in the *B*-factor field of the coordinate file with appropriate *B* factors to downweight the less confident parts of the model, as well as trimming off residues with a pLDDT value less than 70 (on a scale of 0–100).

To determine the effect of model completeness, as well as local map quality, we also tested the effect of using smaller pieces of the structural model (individual chains, domains or subdomains). The models are also summarized in Table 1.

### 4. Implementation of algorithms

The algorithms have been implemented as a combination of Python scripts and C++ code, both making substantial use of the *Computational Crystallography Toolbox* (*cctbx*; Grosse-Kunstleve *et al.*, 2002).

The framework for the docking search has been implemented in the Python program *em_placement*, which is part of the *Voyager* structural biology framework built on *phasertng* (McCoy *et al.*, 2021). Associated tools required to evaluate the map eLLG, map information gain, fast phased translation search and cryo-EM likelihood target have been added to *phasertng*, which already contained tools to compute the rotation function eLLG (McCoy *et al.*, 2017), fast searches and LLG rescoring for rotations (Storoni *et al.*, 2004), and phased rigid-body (Millán *et al.*, 2021).

Note that the symmetry of the reconstruction is not yet used to aid model placement in the current version of the program.

The *em_placement* program is controlled using a set of keywords in the phil syntax used by *Phenix* (Liebschner *et al.*, 2019). An example keyword script is given in Appendix *A*. Most keywords (map files, model file, composition of the reconstruction defined in terms of sequences of the components) will not usually be altered. The nominal resolution of the map is optional but recommended, and the author-defined value in the EMDB entry was used in all cases reported here. It would be appropriate to use either the FSC-derived overall resolution or the highest local resolution in the map. Since the nominal resolution is used as the high-resolution limit for all of the calculations, if the map actually contains valid higher resolution features than the user-entered value some signal will be lost. If the user-entered value extends beyond the real resolution limit, CPU time is wasted but the search results should not be degraded unless the nominal resolution is very over-optimistic. The only parameter that might usefully be varied by the user is the equivalent r.m.s. error that defines the expected model quality. For the test cases, a value of 0.8 Å was used for models obtained from experimental structures of the same protein, a value of 1.0 Å was used for models predicted by *AlphaFold* and a value of 1.0 Å was used for the one experimental structure that differs somewhat in sequence, *i.e.* a model of apoferritin derived from a structure that contains the helix deleted in the target structure. Note that the estimate of the equivalent r.m.s. error is refined as part of the rigid-body so as long as the same solutions are found in the search the final result is the same.

The data used for test calculations are all available through the EMDB (Lawson *et al.*, 2016). Cryo-EM and crystallographic models are available from the worldwide Protein Data Bank (Berman *et al.*, 2007), except for the AF models, which were computed using the community *ColabFold* version (Mirdita *et al.*, 2022) of *AlphaFold* (Jumper *et al.*, 2021).

### 5. Results

#### 5.1. Docking results

The results of the docking trials are summarized in Table 2. The majority of the searches succeeded, and many of these required only a single search over the entire reconstruction. The time required for the searches ranged from half a minute to about 32 min, averaging about 12 min over the set of test cases. When multiple spherical subvolumes were searched, the number varied from four to 214. None of the test cases triggered the fall-back of carrying out a brute-force six-dimensional search.

‡Number of copies placed correctly (or incorrectly in parentheses). §MapCC is the map between the weighted map and a map computed from the model. Scores for incorrectly placed copies are in parentheses. ¶Linux workstation with a 3.8 GHz Intel Core i7-9800X CPU with eight cores but running primarily on a single thread. |

##### 5.1.1. GABA receptor

The highest resolution (1.7 Å) cryo-EM structure in our test set is that of the human γ-aminobutyric acid receptor bound to a megabody: PDB entry 7a5v, EMDB entry EMD-11657 (Nakane *et al.*, 2020).

To provide a reasonable challenge at such high resolution, only small models were tested, each comprising about 1/20 of the full pentamer or 1/4 of a single copy. The membrane domain is well ordered and is easy to place when using the membrane component of a single subunit of a 4cof (Miller & Aricescu, 2014), as a model. However, an *AlphaFold* model of the bound megabody is more difficult to place, as the associated density is the least well ordered in the map. Only two of the five copies were placed successfully, in spite of the fivefold symmetry of the reconstruction. If the subvolume spheres were placed in a way that obeyed the fivefold symmetry, the same results would be obtained for each copy. The sensitivity of the search to the boundaries of the subvolumes is an indicator that this is a marginal model for searching in this map. In principle, the missing copies could be generated by application of the fivefold symmetry.

##### 5.1.2. β-Galactosidase

β-Galactosidase is commonly used as a test object for cryo-EM methodology, as it is well behaved and possesses *D*2 tetrameric symmetry. We chose a medium-resolution (2.2 Å) representative: PDB entry 5a1a, EMDB entry EMD-2984 (Bartesaghi *et al.*, 2015).

Docking a full chain, either from the associated PDB entry or from a 1jz7 (Juers *et al.*, 2001), is straightforward to achieve by searching over the full map. On the other hand, docking just the β-barrel domain of one subunit is substantially more challenging and the map is divided into five subvolumes. All four copies were successfully found, although this success is fragile. A parallel run under MacOS found only three copies using the same computer code, presumably because of the effects of minor numerical differences. Again, the missing copy in that case could have been recovered by exploiting the symmetry of the map.

##### 5.1.3. Apoferritin

Because of its stability and high octahedral (432) symmetry, apoferritin is another very common test object for cryo-EM. We chose a relatively low-resolution (3.0 Å) representative: PDB entry 5xb1, EMDB entry EMD-6714, a deletion mutant of the E-helix (Ahn *et al.*, 2018).

Searching with a single chain from the reference structure finds all 24 copies with strong signal in a search over the full volume; the default to keep a maximum of five potential solutions was overridden for this case. As a more challenging test, we based a search model on a single chain from PDB entry 2cei, the of a full-length version of apoferritin, removing the E-helix from the search model. Again, all 24 copies were found with strong (although slightly lower) signal. Note that much of the computing time in these two tests is expended on evaluating the map correlations for the 24 solutions.

##### 5.1.4. *Escherichia coli* respiratory complex I

The largest series of trials was carried out with the reconstruction for conformation 2 of *E. coli* respiratory complex I: PDB entry 7nyu, EMDB entry EMD-12654 (Kolata & Efremov, 2021). This reconstruction presents a variety of challenges, as the overall resolution (3.8 Å) is already relatively low but also varies substantially over the different subunits. Parts of the membrane domain are particularly poorly resolved; the local resolution of chain *L* is estimated by the authors as being in the range 9–11 Å. An additional challenge comes from the fact that three of the membrane-domain components (chains *L*, *M* and *N*) have related sequences and structures, with pairwise sequence identities of 25–26%. As a result, it is possible to place a model into the density for a related subunit, yielding a nonrandom LLG score above 60.

Models were taken either from the reference structure or from the crystallographic structure of the membrane domain: PDB entry 3rko (Efremov & Sazanov, 2011). Searching for the entire membrane domain gives a clear solution using the full reconstruction. In searches for individual chains, such as the three related membrane-domain components, the reconstruction is automatically divided into subvolumes. For the best-ordered of the three related subunits, chain *N*, an unambiguous solution is found. Chain *M* is more poorly ordered and two potential solutions are found. The solution with higher scores is correctly placed, while the second solution superimposes the chain *M* model on the better-ordered density of chain *N*. Chain *L* is the least well ordered, and the search places the model on the density for either chain *N* or chain *M*, but not on the correct density that corresponds to chain *L*. Fig. 1(*a*) illustrates one of the most difficult successful results, showing the docked model of chain *M*.

##### 5.1.5. DNA mismatch-repair protein MutS

As a representative of a low-resolution (6.9 Å) reconstruction, we chose the *E. coli* DNA mismatch-repair protein MutS in its mismatch-bound state: PDB entry 7ai6, EMDB entry EMD-11792. In this bound state, the protein is a pseudosymmetric dimer, so there are two independent copies to find.

To test a workflow in which individual domains are docked in order to approximate a conformational change, we used as models the N- and C-terminal domains of one chain of MutS in the DNA-free conformation from PDB entry 6i5f (Bhairosing-Kok *et al.*, 2019). For such small fractions of the full structure at such low resolution, the signal in the rotation search would be extremely low, so the subvolume-determination algorithm chose to carry out the searches with multiple subvolumes of the maps. For the (larger) C-terminal domain, 26 spherical subvolumes were chosen. Although this is a relatively large number, each calculation is fast with low-resolution data, and an unambiguous docking of both copies was achieved in less than 4 min. For the (smaller) N-terminal domain, 214 subvolumes were chosen. The search in this case took significantly longer, at about 17 min, and failed to find the correct placements. None of the LLG values exceeded 50.

##### 5.1.6. Cystic fibrosis transmembrane regulator, ΔF508 mutant

The ΔF508 mutant of the cystic fibrosis transmembrane regulator (CFTR), with bound folding modulators, was chosen as a second low-resolution (6.9 Å) reconstruction: PDB entry 8ej1, EMDB entry EMD-28172 (Fiedorczuk & Chen, 2022).

Rather than testing other experimental structures of the same protein, we chose to make *AlphaFold* (Jumper *et al.*, 2021) models in the *ColabFold* environment (Mirdita *et al.*, 2022). Although structures of CFTR would have been present in the training data for *AlphaFold*, their influence was reduced by turning off the option to include explicit templates of related structure in the structure-prediction process. As for the case of MutS, the difficulty of the docking calculations was increased by extracting models of individual domains from the full predicted structure. As expected, it was more difficult to place smaller models. The membrane domain, the largest in the processed model with 585 residues, was placed easily (LLG = 704) in a search over the entire reconstruction. A mid-sized domain comprising 214 residues gave two potential solutions with LLG values of 297 and 188 but searching over ten subvolumes and taking nearly seven times as long. The first potential solution was placed correctly, but the second was superimposed on the smallest domain, which has a similar fold and a sequence identity of 27% over 168 matched residues (of 187 in the smaller of the two domains). Similarly, a search for the smallest domain gave two potential solutions with LLG scores of 220 for the correct solution (Fig. 1*b*) and 134 for a superposition on the mid-sized domain.

##### 5.1.7. Get3, closed conformation

The lowest resolution (8.46 Å) map in the test set is of the closed conformation of the ER targeting factor Get3: EMDB entry EMD-25375 (Fry *et al.*, 2022). The authors did not deposit coordinates in the PDB for this reconstruction, presumably because it had the lowest resolution of a series of maps. Therefore, it makes a good example for the circumstance in which a structural biologist would like to examine a published map in the context of a docked model from another structure.

We chose the 7spz (Fry *et al.*, 2022), as the model. The reconstruction is pseudosymmetric, so there are two independent copies to find. Both of them can be found in a straightforward search over the full reconstruction that takes only about half a minute.

#### 5.2. Checking the eLLG_{rot}-guided subvolume criterion

In the global search, the eLLG_{rot} criterion suggested that a single search sphere covering the entire ordered volume of the reconstruction would give sufficient rotation-function signal for eight of the 18 test cases. Validating this, all eight of these searches succeeded (Table 2). However, if the eLLG_{rot} criterion were too pessimistic about the ability to find the model in the whole map, the eight searches that found at least one copy when searching over subvolumes might have succeeded with a global search. To test this, we used a manual override in the *em_placement* program to force a search over a single sphere covering the entire ordered volume. Because the two models for chain *M* of *E. coli* respiratory complex I are very similar, we only tested chain *M* from the in PDB entry 3rko. The results are given in Table 3.

‡Number of copies placed correctly (or incorrectly in parentheses). §Scores for incorrectly placed copies are in parentheses. ¶Linux workstation with a 3.8 GHz Intel Core i7-9800X CPU with 16 cores but running primarily on a single thread. |

The results support the eLLG_{rot} criterion as an effective guide to search strategy. No correct solution is found for four of the seven test cases, and only one of two solutions is found when searching for the C-terminal domain of MutS. The only cases where the criterion was clearly too pessimistic about the ability to find the model in the whole map are the searches for the mid-sized and smallest domains of the *AlphaFold* model for the ΔF508 mutant of CFTR. Here the correct solutions are found in about 2 min each in the whole map, whereas the global subvolume searches took 11–13 min (Table 2). However, the forced global search failed to find the nonrandom solutions mentioned in Section 5.1.4 in which the two homologous domains were placed in positions belonging to each other.

#### 5.3. Tests of brute-force six-dimensional searches

The two cases where the global search failed, as well as the MutS case in which 26 subvolumes were explored, provided tests of the brute-force 6D fall-back algorithm. These were carried out to examine whether the global 6D search could succeed for cases where rotation searches for the smallest practical subvolume would have insufficient signal, and also how it compares in efficiency with searching over a large number of subvolumes.

##### 5.3.1. Chain *L* of *E. coli* respiratory complex I

The brute-force 6D search fails to find the correct position of chain *L*, but does reproduce the results of the automated search using multiple subvolumes as the model for chain *L* is superimposed on the map regions for chains *M* and *N*. The run time is dramatically longer at approximately 11 h, compared with about 10 min for the automated search with multiple sub-volumes.

##### 5.3.2. N-terminal domain of MutS

The brute-force search is more successful than the adaptive search over 214 subvolumes, as one of the two copies of this domain is found with an LLG of 83 and a map correlation of 0.536. Although the next best potential solution has an LLG of 60, none of the other potential solutions are correct. This partial success comes at the cost of about 44 min of running time. This is the only test case in which triggering the fall-back to a brute-force 6D search would have been justified.

##### 5.3.3. C-terminal domain of MutS

Both copies of the C-terminal domain of PDB entry 6i5f are found in the brute-force 6D search with the same scores. However, the search using 26 subvolume spheres is dramatically more efficient, taking about 4 min compared with 139 min for the brute-force 6D search.

### 6. Discussion and conclusions

The strength of likelihood as a criterion is supported by the success of our new likelihood-based approach to docking models in a series of progressively more challenging cryo-EM maps. Since the successful application of likelihood to a problem requires a good model of the sources of error and their propagation, these results also support our approach to defining and calibrating likelihood targets for cryo-EM data in the accompanying paper (Read *et al.*, 2023).

The outcomes of different search strategies can be predicted by an analysis of the expected log-likelihood-gain (eLLG) score for both the rotation-search and translation-search components of the docking algorithm. The rotation-function eLLG can be used to predict how large a volume of the map can be explored in one rotation search, allowing automated decisions about the subdivision of the full map into spherical subvolumes. The choices made by this criterion have been validated by comparing the success of searches over the full map with those carried out over the suggested subvolumes.

Docking models into the most poorly ordered part of a map is difficult, partly because of the reduced signal to noise but also because the assessment of global map quality can mislead the algorithm determining the search strategy into choices that provide insufficient signal in the worst regions of the map. This could potentially be mitigated by adapting the strategic choices to local levels of signal to noise in the reconstruction.

Plans for future enhancements include accounting for symmetry in the search space, which will be significantly more efficient in the case of high-symmetry reconstructions such as those for apoferritin. Calculations could be made faster by using parallel processing, particularly for searches over multiple subvolumes. Searches for multiple components will be implemented, which requires accounting for the contribution of previously placed components in the fit to the experimental data, as well as avoiding clashes between components.

### APPENDIX A

### Example script for *em_placement*

The following script defines the search parameters for the *em_placement* script used to run the first test case, docking the membrane component of a model of the GABA receptor derived from PDB entry 4cof into the cryo-EM reconstruction deposited as EMDB entry EMD-11657:

Using a recent *Phenix* version (the dev-4887 development release or newer), this script can be run with the command `phenix.voyager.em_placement docking_script.phil`.

Most parameters specified in the script have been named in a way intended to convey the purpose of the parameter. The `point_group_symmetry` feature is only used at the moment to optionally generate a full assembly from a single copy. The `sequence_composition` parameter specifies the name of a file containing the sequences of all the components in the reconstruction.

### Footnotes

‡Present address: SciBite Limited, BioData Innovation Centre, Wellcome Genome Campus, Hinxton, Cambridge CB10 1DR, United Kingdom.

### Acknowledgements

We thank Cathy Lawson for implementing a method to allow searches at the EMDataResource for EMDB entries providing half-maps.

### Funding information

The following funding is acknowledged: Wellcome Trust (grant No. 209407/Z/17/Z to Randy J. Read); National Institutes of Health, National Institute of General Medical Sciences (grant No. P01GM063210 to Randy J. Read, Thomas C. Terwilliger).

### References

Ahn, B., Lee, S.-G., Yoon, H. R., Lee, J. M., Oh, H. J., Kim, H. M. & Jung, Y. (2018). *Angew. Chem. Int. Ed.* **57**, 2909–2913. Web of Science CrossRef CAS Google Scholar

Bartesaghi, A., Merk, A., Banerjee, S., Matthies, D., Wu, X., Milne, J. L. S. & Subramaniam, S. (2015). *Science*, **348**, 1147–1151. Web of Science CrossRef CAS PubMed Google Scholar

Berman, H., Henrick, K., Nakamura, H. & Markley, J. L. (2007). *Nucleic Acids Res.* **35**, D301–D303. Web of Science CrossRef PubMed CAS Google Scholar

Bhairosing-Kok, D., Groothuizen, F. S., Fish, A., Dharadhar, S., Winterwerp, H. H. K. & Sixma, T. K. (2019). *Nucleic Acids Res.* **47**, 8888–8898. Web of Science CAS PubMed Google Scholar

Efremov, R. G. & Sazanov, L. A. (2011). *Nature*, **476**, 414–420. Web of Science CrossRef CAS PubMed Google Scholar

Fiedorczuk, K. & Chen, J. (2022). *Science*, **378**, 284–290. CrossRef CAS PubMed Google Scholar

Fry, M. Y., Najdrová, V., Maggiolo, A. O., Saladi, S. M., Doležal, P. & Clemons, W. M. (2022). *Nat. Struct. Mol. Biol.* **29**, 820–830. Web of Science CrossRef CAS PubMed Google Scholar

Goddard, T. D., Huang, C. C., Meng, E. C., Pettersen, E. F., Couch, G. S., Morris, J. H. & Ferrin, T. E. (2018). *Protein Sci.* **27**, 14–25. Web of Science CrossRef CAS PubMed Google Scholar

Grosse-Kunstleve, R. W., Sauter, N. K., Moriarty, N. W. & Adams, P. D. (2002). *J. Appl. Cryst.* **35**, 126–136. Web of Science CrossRef CAS IUCr Journals Google Scholar

Hoffmann, A., Perrier, V. & Grudinin, S. (2017). *J. Appl. Cryst.* **50**, 1036–1047. Web of Science CrossRef CAS IUCr Journals Google Scholar

Juers, D. H., Heightman, T. D., Vasella, A., McCarter, J. D., Mackenzie, L., Withers, S. G. & Matthews, B. W. (2001). *Biochemistry*, **40**, 14781–14794. Web of Science CrossRef PubMed CAS Google Scholar

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., Petersen, S., Reiman, D., Clancy, E., Zielinski, M., Steinegger, M., Pacholska, M., Berghammer, T., Bodenstein, S., Silver, D., Vinyals, O., Senior, A. W., Kavukcuoglu, K., Kohli, P. & Hassabis, D. (2021). *Nature*, **596**, 583–589. Web of Science CrossRef CAS PubMed Google Scholar

Kolata, P. & Efremov, R. G. (2021). *eLife*, **10**, e68710. Web of Science CrossRef PubMed Google Scholar

Kovacs, J. A. & Wriggers, W. (2002). *Acta Cryst.* D**58**, 1282–1286. Web of Science CrossRef CAS IUCr Journals Google Scholar

Lawson, C. L., Patwardhan, A., Baker, M. L., Hryc, C., Garcia, E. S., Hudson, B. P., Lagerstedt, I., Ludtke, S. J., Pintilie, G., Sala, R., Westbrook, J. D., Berman, H. M., Kleywegt, G. J. & Chiu, W. (2016). *Nucleic Acids Res.* **44**, D396–D403. Web of Science CrossRef CAS PubMed Google Scholar

Liebschner, D., Afonine, P. V., Baker, M. L., Bunkóczi, G., Chen, V. B., Croll, T. I., Hintze, B., Hung, L.-W., Jain, S., McCoy, A. J., Moriarty, N. W., Oeffner, R. D., Poon, B. K., Prisant, M. G., Read, R. J., Richardson, J. S., Richardson, D. C., Sammito, M. D., Sobolev, O. V., Stockwell, D. H., Terwilliger, T. C., Urzhumtsev, A. G., Videau, L. L., Williams, C. J. & Adams, P. D. (2019). *Acta Cryst.* D**75**, 861–877. Web of Science CrossRef IUCr Journals Google Scholar

Mariani, V., Biasini, M., Barbato, A. & Schwede, T. (2013). *Bioinformatics*, **29**, 2722–2728. Web of Science CrossRef CAS PubMed Google Scholar

McCoy, A. J., Oeffner, R. D., Wrobel, A. G., Ojala, J. R. M., Tryggvason, K., Lohkamp, B. & Read, R. J. (2017). *Proc. Natl Acad. Sci. USA*, **114**, 3637–3641. Web of Science CrossRef CAS PubMed Google Scholar

McCoy, A. J., Stockwell, D. H., Sammito, M. D., Oeffner, R. D., Hatti, K. S., Croll, T. I. & Read, R. J. (2021). *Acta Cryst.* D**77**, 1–10. Web of Science CrossRef IUCr Journals Google Scholar

Millán, C., Keegan, R. M., Pereira, J., Sammito, M. D., Simpkin, A. J., McCoy, A. J., Lupas, A. N., Hartmann, M. D., Rigden, D. J. & Read, R. J. (2021). *Proteins*, **89**, 1752–1769. Web of Science PubMed Google Scholar

Miller, P. S. & Aricescu, A. R. (2014). *Nature*, **512**, 270–275. Web of Science CrossRef CAS PubMed Google Scholar

Mirdita, M., Schütze, K., Moriwaki, Y., Heo, L., Ovchinnikov, S. & Steinegger, M. (2022). *Nat. Methods*, **19**, 679–682. Web of Science CrossRef CAS PubMed Google Scholar

Nakane, T., Kotecha, A., Sente, A., McMullan, G., Masiulis, S., Brown, P. M. G. E., Grigoras, I. T., Malinauskaite, L., Malinauskas, T., Miehling, J., Uchański, T., Yu, L., Karia, D., Pechnikova, E. V., de Jong, E., Keizer, J., Bischoff, M., McCormack, J., Tiemeijer, P., Hardwick, S. W., Chirgadze, D. Y., Murshudov, G., Aricescu, A. R. & Scheres, S. H. W. (2020). *Nature*, **587**, 152–156. Web of Science CrossRef CAS PubMed Google Scholar

Oeffner, R. D., Afonine, P. V., Millán, C., Sammito, M., Usón, I., Read, R. J. & McCoy, A. J. (2018). *Acta Cryst.* D**74**, 245–255. Web of Science CrossRef IUCr Journals Google Scholar

Oeffner, R. D., Croll, T. I., Millán, C., Poon, B. K., Schlicksup, C. J., Read, R. J. & Terwilliger, T. C. (2022). *Acta Cryst.* D**78**, 1303–1314. Web of Science CrossRef IUCr Journals Google Scholar

Read, R. J., Millan, C., McCoy, A. J. & Terwilliger, T. C. (2023). *Acta Cryst.* D**79**, https://doi.org/10.1107/S2059798323001596. CrossRef IUCr Journals Google Scholar

Roseman, A. M. (2000). *Acta Cryst.* D**56**, 1332–1340. Web of Science CrossRef CAS IUCr Journals Google Scholar

Simpkin, A. J., Winn, M. D., Rigden, D. J. & Keegan, R. M. (2021). *Acta Cryst.* D**77**, 1378–1385. Web of Science CrossRef IUCr Journals Google Scholar

Storoni, L. C., McCoy, A. J. & Read, R. J. (2004). *Acta Cryst.* D**60**, 432–438. Web of Science CrossRef CAS IUCr Journals Google Scholar

Titarenko, V. & Roseman, A. M. (2021). *Acta Cryst.* D**77**, 447–456. Web of Science CrossRef IUCr Journals Google Scholar

Wriggers, W. (2012). *Acta Cryst.* D**68**, 344–351. Web of Science CrossRef IUCr Journals Google Scholar

Zundert, G. C. P. van & Bonvin, A. M. J. J. (2015). *AIMS Biophys.* **2**, 73–87. Google Scholar

This is an open-access article distributed under the terms of the Creative Commons Attribution (CC-BY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.