research papers
Current approaches for the fitting and CCPEM
of atomic models into cryoEM maps using^{a}Structural Studies, MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge CB2 0QH, England
^{*}Correspondence email: garib@mrclmb.cam.ac.uk
Recent advances in instrumentation and software have resulted in cryoEM rapidly becoming the method of choice for structural biologists, especially for those studying the threedimensional structures of very large macromolecular complexes. In this contribution, the tools available for macromolecular structure via CCPEM are reviewed, specifically focusing on REFMAC5 and related tools. Whilst originally designed with a view to against Xray diffraction data, some of these tools have been able to be repurposed for cryoEM owing to the same principles being applicable to against cryoEM maps. Since both techniques are used to elucidate macromolecular structures, tools encapsulating prior knowledge about macromolecules can easily be transferred. However, there are some significant qualitative differences that must be acknowledged and accounted for; relevant differences between these techniques are highlighted. The importance of phases is considered and the potential utility of replacing inaccurate amplitudes with their expectations is justified. More pragmatically, an upper bound on the correlation between observed and calculated Fourier coefficients, expressed in terms of the Fourier shell correlation between halfmaps, is demonstrated. The importance of selecting appropriate levels of map blurring/sharpening is emphasized, which may be facilitated by considering the behaviour of the average map amplitude at different resolutions, as well as the utility of simultaneously viewing multiple blurred/sharpened maps. Features that are important for the purposes of computational efficiency are discussed, notably the Divide and Conquer pipeline for the parallel of large macromolecular complexes. Techniques that have recently been developed or improved in Coot to facilitate and expedite the building, fitting and of atomic models into cryoEM maps are summarized. Finally, a tool for symmetry identification from a given map or coordinate set, ProSHADE, which can identify the of a map and thus may be used during deposition as well as during molecular visualization, is introduced.
into cryoEM reconstructions that are availableKeywords: REFMAC5; cryoEM; model refinement; map blurring; map sharpening; Divide and Conquer; ProSHADE; symmetry detection.
1. Introduction
Macromolecular Xray crystallography (MX), nuclear magnetic resonance (NMR) and cryoelectron microscopy (cryoEM) are the three main experimental techniques that are used to elucidate macromolecular structures in order to answer biological questions. At present, the majority of the structural models deposited in the Protein Data Bank (PDB; Berman et al., 2002) have been derived using MX (>120 000 models), an order of magnitude more than the second most commonly used technique, NMR (>12 000 ensembles). Although the current proportion of models derived using cryoEM is comparatively small (>2000), it is becoming the tool of choice owing to the socalled `resolution revolution' caused by rapid advances in instrumentation and software (Faruqi & McMullan, 2011; Lyumkis et al., 2013; Kühlbrandt, 2014; Scheres, 2014).
Whilst the purpose of these experimental techniques is to answer particular biological questions, our aim is to facilitate this using all available structural information; the purpose of computational tools is to extract as much information as possible from a given data set. Since the information contained in noisy and limited data can be hard to extract, we must develop mathematical and computational tools to help to maximize information extraction in such challenging cases. Of course, there are no computational tools that can replace carefully designed experiments; computation can only aid in experimental design and help to increase the amount of information extracted from the data.
The resolutions quoted for cryoEM reconstructions vary greatly, and there is a difference in the way in which maps are modelled at different resolutions. If sufficiently highquality data are available it is now possible to consider de novo model building and full atomic (e.g. above ∼4 Å, using currently available technology and the current definition of resolution; Rosenthal & Henderson, 2003). However, at lower resolutions the limited number of observations means that additional prior information, in the form of precomputed atomic models, may be required. Such initial models would be fitted/morphed into blobs of density. CCPEM (Wood et al., 2015) contains various tools to facilitate model fitting/refinement, including DockEM (Roseman, 2000), Choyce (Rawi et al., 2010) and FlexEM (Joseph et al., 2016) for fitting and morphing of the structure at low resolution in cases where only information about the overall shape of the molecule is available, and REFMAC5 (Murshudov et al., 2011) for full atomic model at higher resolutions where at least some bulky side chains are visible. It should be noted that other software tools are available for the fitting and of atomic models into cryoEM maps, including DireX (Schröder et al., 2007), MDFF (Trabuco et al., 2008), CryoFit (Kirmizialtin et al., 2015), Rosetta (Wang et al., 2016) and phenix.real_space_refine (Afonine et al., 2018).
Experimental data alone are typically insufficient to successfully build and refine an atomic model. Fortunately, our interpretation of incomplete and noisy experimental observations can be improved by using additional sources of information: the stereochemistry of constituent blocks of macromolecules, typical secondarystructure patterns, structures of related macromolecular domains, structural data obtained using different experimental methods etc. Information derived from different experiments can be simultaneously coutilized in order to better address biological questions (Trabuco et al., 2009; Gong et al., 2015; Kovalevskiy et al., 2018). Ideally, staying within realistic and computationally tractable bounds, all available sources of experimental and theoretical information relevant to the molecule of interest should be integrated into one process, with the intention of delivering the best possible structural model for a given state of the molecule.
In both MX and cryoEM we do not refine against the original data observed in the experiment (in MX the raw data are diffractionpattern images, while in singleparticle cryoEM the data are twodimensional particles). Rather we refine against derived data, effectively treating them as if they were experimental observations. Despite the inherent loss of information, this should give reasonable results providing that error estimates are sufficiently accurate.
In MX i.e. (McCoy et al., 2007; Vagin & Teplyakov, 2010), or by carrying out additional experiments and determining phases for a substructure (SAD, MAD, SIR etc.; Sheldrick, 2015; Skubák & Pannu, 2013). Often macromolecules, especially large complexes of exceptional biological interest, fail to form highquality crystals, resulting in poor diffraction. In such cases, the resulting electrondensity maps can be hard to interpret, and additional sources of structural information are often useful when interpreting such poorquality maps. Indeed, there is an emphasis on the importance of improving phases and the resulting electrondensity maps in MX.
the `observations' are taken to be the estimated diffractionspot intensities, which are often converted to structurefactor amplitudes during prerefinement data processing. Consequently, one needs to solve the `phase problem' either by using prior knowledge about a related structure,This contrasts with atomic model i.e. corefinement of atomic model and threedimensional reconstructions), although such attempts do include density sharpening by LocScale (Jakobi et al., 2017). Local map quality, i.e. local resolution, varies greatly within one and between several reconstructions (Kucukelbir et al., 2014). Again, the use of additional prior knowledge can help with interpreting the lower resolution parts of such maps that exhibit varying signaltonoise ratios.
in cryoEM, in which the `observations' are taken to be the electrostatic potential maps: the outputs of the threedimensional reconstruction process, which do contain phase information. To date, there have been limited attempts to improve the input maps using information about the current state of the atomic model during (Since the atomic models in both MX and cryoEM correspond to macromolecules, the prior knowledge used in both of these techniques is essentially the same (Brown et al., 2015). However, there are qualitative differences between these techniques that affect how model is approached, and the types of problems that are typically encountered include the following.
Indeed, one major consequence of the ability/inability to observe phase information is in how the maps are calculated. In MX the electrondensity maps viewed are calculated using phase information from the current state of the model (e.g. 2mF_{o} − DF_{c} maps), which means that they inherently suffer from model bias. As a result of this, they also have to be updated/recalculated after each round of This means that the interpretation of these density maps can vary dramatically during the process, particularly at lower resolutions where the data are limited and noisy. In contrast, standard cryoEM maps are, at present, not calculated using phase information from the current state of the model; they are not updated/recalculated after each round of Consequently, cryoEM atomic model can be considered as the problem of fitting into the map whilst ensuring consistency with prior information, ensuring chemical and structural integrity of the model according to our current knowledge of macromolecular structures.
In this contribution, we review the tools available for macromolecular structure via CCPEM (Burnley et al., 2017). Specifically, we focus on the program REFMAC5 (Murshudov et al., 2011), noting that other software and suites employ similar technologies (see, for example, Afonine et al., 2013). These tools were originally designed with a view to against MX diffraction data, but the same principles are applicable to against cryoEM maps (Murshudov, 2016; Brown et al., 2015). The importance of phases is considered, as well as the potential utility of replacing inaccurate amplitudes with their expected values. Issues that should be contemplated when performing against cryoEM reconstructions are emphasized, notably an upper bound on the correlation between the atomic model and observed map, based on the Fourier shell correlation between halfmaps, and how care should be taken when choosing which level of blurred/sharpened map to refine the model against. Features important for the purposes of computational efficiency are discussed: the necessity for appropriate boxsize selection, and the Divide and Conquer pipeline, which enables the of large complexes to be computationally tractable by refining different parts of the model in parallel. For completeness, other relevant tools to aid atomic model in REFMAC5 and Coot (Emsley et al., 2010) are discussed, notably the use of prior information and tools with a wider radius of convergence to facilitate the of highly displaced regions of a model. Finally, a new tool for the detection of rotational symmetries in atomic models and maps, ProSHADE, is presented. It should be emphasized that the recommendations and features described in this contribution relate to the current state of existing software tools; in future it would be advantageous for improved techniques and tools to be developed and implemented.
into cryoEM reconstructions that are available2. The importance of phases
It is well known that for the purpose of map calculation the phases of structure factors are more important than their amplitudes. To analyse this statement, we can consider the correlation between the current and `ideal' maps. Specifically, since correlations calculated in real and
are equivalent, we consider the Fourier shell correlation (FSC) calculated over all structure factors,where a subscript C denotes the current map and a subscript t denotes the `true' (or `ideal') map, and ρ represents map density with corresponding structure factors F with amplitudes F and phases φ. If we consider structure factors in a narrow resolution range then we can express the FSC in terms of the normalized amplitudes E. Then, under the assumptions that the reciprocalspace points are sufficiently dense and that the distribution of Fourier coefficients in shells reflects the `true' distribution, we can express the FSC as the expected value of the weighted cosine of phase differences,
It is clear that for `good' maps the FSC would be higher; if we had two maps, and were able to calculate the FSC between these maps and the corresponding `true' map, then we would prefer the one that exhibited the higher FSC. An important, if perhaps obvious, point to note is that if the phases are random and the amplitudes are exact then the FSC will be zero.
Now consider the limiting case where the phases are exact and the amplitudes are random. Under the assumption that the amplitudes come from a Wilson distribution, we have
However, if we replace all amplitudes by their expected value (in a given bin), i.e. E_{C} = 〈E_{C}〉, then we instead obtain
Consequently, given that this value is greater than 0.785, if the structure factors have little or no information about the structure under analysis then it might be better to replace the observed amplitudes with their expected values (for further discussion in the context of MX, see Nicholls et al., 2017).
It must be emphasized that the above analysis is valid for map calculation only, i.e. not for use in model Indeed, for the of model parameters we must have the conditional probability distribution of observed data given model parameters. It seems sensible to use all observations during provided that errors are estimated accurately.
3. Correlation between atomic model and observed maps
In practice, we are not able to directly calculate the FSC between the current and `ideal' maps. However, we are able to calculate the correlation between the observed and calculated coefficients: cor(F_{o}, F_{c}). Furthermore, if halfdata sets are available then we are also able to calculate the FSC between the two halfdata sets: FSC_{1/2}. Therefore, we are able to consider the relationship between FSC_{1/2} and cor(F_{o}, F_{c}).
Let us assume that we have observations, and that the errors in the observations are additive: F_{o}(s) = F_{T}(s) + F_{n}(s), where a subscript o denotes observations, a subscript T denotes the `true' image and a subscript n denotes noise. Let us further assume that we have Fourier coefficients F_{c}(s) calculated from the atomic model. In the absence of overfitting we can also assume that there is no correlation between calculated Fourier coefficients and noise in the data. The correlation between observed and calculated Fourier coefficients in narrow Fourier shells can thus be calculated as
If we have halfdata sets, then (see Appendix A)
resulting in
This relationship only holds if there is no correlation between the signal F_{c}(s) and noise in the data. However, when the atomic model is refined against observed data then fitting, at least partially, into the noise is unavoidable. Consequently, we can use this relationship to infer an upper limit on the correlation between observed and calculated Fourier coefficients: it should never exceed [2FSC_{1/2}/(1 + FSC_{1/2})]^{1/2}. For example, if the FSC_{1/2} is 0.5 then the correlation between the observed and calculated Fourier coefficients should not exceed (2/3)^{1/2} ≃ 0.82. Were a correlation higher than this value observed then further investigation would be required.
4. Blurring and sharpening
In cryoEM, variability of the reconstructed molecule owing to heterogeneity of the sample and computational inaccuracies during the reconstruction causes blurring of the signal in the map. Map sharpening has been used to counter overblurred maps, resulting in features in the map being revealed (Brunger et al., 2009; Nicholls et al., 2012), noting that other approaches towards map modification have been employed with a similar objective both in the context of MX (Afonine et al., 2015) and cryoEM (Jakobi et al., 2017; Terwilliger et al., 2018). Conversely, if the map has been oversharpened then blurring may be required. Reconstruction programs perform postprocessing in order to deblur or sharpen the resultant map. However, even if the noise variance is constant within the reconstructed map, owing to the varying mobility of the molecule over space it can be expected that the signaltonoise ratio will also vary over space. The deblurring parameter should depend on the signaltonoise ratio, so a single parameter value may not be sufficient for all parts of the map. Consequently, different parts of the map may require different levels of sharpening/blurring for optimal interpretation, and thus may still need additional sharpening or blurring in order to achieve optimal results.
In MX, map calculation is usually performed as a separate step after
so sharpening/blurring does not affect However, in cryoEM sharpening/blurring is performed after reconstruction but before model building, so it may directly affect the behaviour of Consequently, careful thought is required as to the appropriate level of sharpening/blurring in order to achieve optimal results. Furthermore, the appropriate levels of sharpening/blurring required for model building may differ from those required for refinement.Refining against an overblurred map will have a negative affect on the atomic model, as it may increase the overall B value beyond reason. As a lowpass filter, map blurring reduces highfrequency noise whilst reducing finer structural details (for example side chains). Conversely, oversharpening is inadvisable; as it exacerbates highfrequency noise in the map and increases the seriestermination effect, it may mask out the signal and result in an uninterpretable map. In the extreme, the visual distinction between protein and solvent regions would diminish. Consequently, the selection of appropriate levels of blurring/sharpening is important (Fig. 1).
Selecting appropriate blurring/sharpening B values may be facilitated by analysing how the average structurefactor amplitude varies with resolution: this should gradually decay with increasing resolution, yet if the map is oversharpened then it will instead increase (Fig. 2).
However, considering the behaviour of the average structurefactor amplitude only gives an overall picture, whereas different localized regions of the map may require different levels of blurring/sharpening. Consequently, owing to map heterogeneity it is advisable to work with multiple blurred and sharpened maps. Doing so can allow the accurate building of atomic models, accounting for overall shape and allowing the backbone to be traced, as well as finer structural details such as the position and orientation of side chains. Indeed, it is often useful to view multiple maps with differing levels of sharpening/blurring simultaneously in order to maximize visual interpretability; this strategy can help to gain more information than can be obtained by looking at a single map alone (Fig. 3).
An array of blurred/sharpened maps can be output by REFMAC5 and loaded into Coot automatically using the CCPEM GUI (Burnley et al., 2017). However, note that this is for gaining intuition by visual inspection of the maps, and that at present REFMAC5 will only use one map with one level of blurring/sharpening when performing the actual Indeed, care must then be taken to ensure that the appropriate level of sharpening/blurring is used for atomic model refinement.
Analysis of the distribution of B values in an atomic model can also help in deciding an appropriate level of blurring/sharpening to be applied to the map prior to Note that it does not make physical sense for a large number of atomic B values to cluster around a small value (or become negative). However, if the map has been excessively oversharpened then the atomic B values can become stuck, clustering around a minimum value; in such cases it may not be possible to recover the distribution of B values by reblurring. If all B values are high then this could indicate that the map has been overblurred, in which case it is possible to further sharpen the map.
However, it should be noted that the R factors change (decrease) as the blurring B value increases. This is systematic behaviour and does not necessarily imply a model of increased quality. Indeed, overall R factors depend on the overall B value (Brown et al., 2015). The FSC is much more stable under different levels of blurring, but there is still an effect. This means that it is not appropriate to compare (especially R factors) between models if the maps have been subjected to different levels of blurring.
5. Selected tools for atomic model refinement
One important similarity between MX and cryoEM is that the data derived using both techniques come from scattering experiments. In both cases there is typically highresolution information loss. Sufficient quality lower resolution information is typically obtained (for example regarding the overall shape and position of macromolecular domains) but the quality of the data degrades as the resolution increases, inhibiting the observation of finer structural details. Thus, for both techniques it is necessary to somehow account for the loss of highresolution information that could not be observed sufficiently well during the experiment.
Many of the software tools that have been developed and established for MX
were designed to deal with this type of problem: refining atomic models in the presence of highresolution information loss. Owing to this inherent similarity between the problems of MX and cryoEM model it has been possible to repurpose many of these software tools for use in the of models derived using highresolution cryoEM.There has been debate as to whether i.e. on the reconstruction methods). There is a common misconception that the original data are in real space. Reconstructions are typically performed in Fourier space using the projectionslice theorem, as for example in RELION (Scheres, 2012) and Frealign (Grigorieff, 2007), before the reconstructed maps are subsequently calculated in real space. Alternatively, threedimensional reconstruction may be performed in real space using backprojection, noting that this results in a large correlation radius of errors in real space. These two procedures in real and are mathematically equivalent; in both cases should be equivalent (apart from details of implementation). The community would benefit from clarification about best practice on this controversial topic; thus proper analysis will be required in the future. For discussion of the similarity of real and reciprocalspace details of the reciprocalspace target used in REFMAC5 for cryoEM and the utility of halfmaps for purposes of validation, see Brown et al. (2015) and Murshudov (2016).
should be performed in real space or both approaches have advantages and disadvantages. In real space, can be performed locally, which has advantages for computational speed and parallelization. However, in allows errors in Fourier coefficients to be accounted for more accurately. Although the errors in neighbouring Fourier coefficients are correlated, they are less correlated than those between proximal points in real space. Furthermore, the degree to which errors are correlated will depend on the nature of the underlying data (5.1. Prior information
The prior probability distribution must minimally contain information about bond lengths and angles: basic chemical information, such as `ideal' bond lengths and angles, is usually employed universally. As the resolution decreases, longer and longer range information is needed to complement the data. The use of information about torsion angles, secondary structures, domains and intradomain interactions might be required. Bvalue restraints are also used, as it is generally expected that neighbouring atoms will have similar B values in regions where modelled atoms are positioned sufficiently accurately (Nicholls et al., 2017).
Additional sources of prior knowledge relevant to macromolecules include structural information from reference models of known homologues, knowledge about secondary structures, hydrogenbonding patterns etc. This information is encapsulated in the form of external restraints, which may be generated using software tools such as ProSMART (Nicholls et al., 2014) and LibG (Brown et al., 2015), and used during by REFMAC5 (Nicholls et al., 2012) and Coot (Fig. 4). Such structural information has also been exploited in a similar way by other software packages (Headd et al., 2014, 2012; Schröder et al., 2010; Sheldrick, 2015; Smart et al., 2012). Additionally, prior information encapsulating local conformational conservation can be exploited, keeping local interatomic distances similar to those in the starting atomic model. Jellybody restraints have proven to be particularly useful regularisers as they do not inject any information that was derived externally (for discussion, see Nicholls et al., 2017, 2013). They are often used to help modelled regions refine into a map in a concerted fashion (having a wide radius of convergence) as well as to ensure the stability of during all stages of the process.
One of the problems of using longrange information as prior knowledge is the inherent dependence on the structural environments of the molecules. Consequently, special care must be exercised when using such information: well known techniques such as robust estimator functions (Huber, 2011) are used in order to improve the application of longrange information derived from known structures (Fig. 5).
5.2. Boxsize selection
Unlike in MX, in cryoEM there is no fixed
Boundaries are not enforced by the experiment, and thus they have to be chosen. The selection of an appropriate box size is important from a computational perspective: choosing a larger box size for a given resolution would result in the requirement for finer sampling in Fourier space in order to avoid a loss of map information owing to interpolation. In turn, using a finer sampling would dramatically slow Fourier spacebased optimization procedures.It is typical for the map output from the threedimensional reconstruction to have a box size that is larger than necessary for use in model REFMAC5 section of the CCPEM interface, allowing the box size to be determined (reduced) automatically by creating a mask of a given radius around the model. By default, a 3 Å hard mask around the atomic coordinates is used at present.
Consequently, it is often necessary to reduce the box size prior to This is available as an option in the5.3. Divide and Conquer
Attempting to fit and refine atomic models into cryoEM reconstructions corresponding to very large complexes can be a computational challenge. Complexes consisting of several hundreds of protein chains, with molecular weights of over 10 MDa, are now being encountered in practical application (see, for example, Zhang et al., 2017). Dealing with such cases can be a technical challenge, comprising many reconstructions of partially overlapping maps, extending to high resolution (in the range 3–4 Å), split across multiple files.
Such large complexes cannot be refined as a whole, owing to both computer memory limitations and the computational complexity associated with increasing map sizes. To refine such huge structures, we split the map and model into smaller more manageable portions, refine them separately and then put them back together at the end. Specifically, this procedure performs the following.

This approach, termed Divide and Conquer, is available on request and has already been successfully used for (Zhang et al., 2017). Divide and Conquer will be available as an option in the CCPEM interface that is intended to expedite and parallelize the of huge models containing hundreds of chains (Fig. 6).
5.4. Other relevant tools for model in REFMAC5 and Coot
Following fold recognition and the building/placement of an initial atomic model, it is often the case that the model is located out of the density. In such cases, the model will need to be optimally positioned before detailed et al., 2009). To account for conformational differences between the initial starting model and the map, it is often more appropriate to use with jellybody restraints in REFMAC5 (Murshudov et al., 2011). These restraints keep the of the molecule intact, whilst allowing groups of atoms (for example secondary structures or domains) to move in a concerted fashion. This helps to avoid local minima, increasing the radius of convergence of (Nicholls et al., 2013), noting that other suites employ comparable or different approaches to address this type of problem (Schröder et al., 2007; Wang et al., 2016).
can be performed. For this type of application, it is necessary to use a technique with a sufficiently large radius of convergence. In simple cases this can be achieved using rigidbody (available in various packages; see, for example, AfonineIn cases where encountering local minima during Coot (Emsley et al., 2010) includes tools designed for this purpose that help to improve the local fit to density. Specifically, Jiggle Fit helps to appropriately position and orient the atomic model (rigidbody fitting), and Model Morphing (Terwilliger et al., 2013) allows localized regions to be fitted into the map by applying local shifts to the atomic model, whilst ensuring robustness so as to avoid geometric distortions (Brown et al., 2015). These tools are particularly useful when dealing with highly displaced regions of the model (for example macromolecular domains). Typically, such procedures should then be followed by full model in REFMAC5 using jellybody restraints in order to stabilize It should be noted that flexible moleculardynamicsbased fitting/refinement is available as an alternative approach to morphing (for example FlexEM; Joseph et al., 2016).
is unavoidable, the use of other algorithms or manual intervention may be required.Following initial fitting of the model, atomic `realspace' Coot, allowing the fit of localized regions of the atomic model to be optimized, combined with manual intervention. In cryoEM maps, the signaltonoise ratio is often such that additional restraints are needed to stabilize the model. Such restraints, for example those generated by ProSMART or LibG, can be imported into Coot for use during realspace These interatomic distance restraints can be displayed for purposes of visualization, providing feedback regarding the consistency between the restraints and the current state of the model (see Figs. 4 and 5).
can be performed withinWhilst realspace Coot is most typically used to refine individual residues or localized regions, it is sometimes desirable to refine larger regions (for example whole chains). This has recently become computationally tractable owing to parallelization of realspace in Coot.
in6. Rotational symmetry
Many protein structures are known to have rotational symmetry, with over 38% of the entries in the PDB having some form of rotational symmetry assigned. The symmetry information is frequently used in structure solution as well as to decrease the storage requirements by storing only the asymmetric portion of the structure and all symmetry operators required to generate the full structure. While the symmetry is usually known when the structure is being solved, there is a lack of a simple tool for rotational symmetry detection in either electrondensity maps or atomic models.
6.1. Rotational symmetry detection using rotation function
The developed tool ProSHADE can take either an atomic model or a density map as an input; the atomic models are converted into a theoretical density map using the Clipper library (Cowtan, 2003) before subsequent processing. Density maps are then mapped onto a set of concentric spheres, and each sphere is decomposed using the spherical harmonics decomposition. The spherical harmonic coefficients are used to compute the rotation function integral over the radius (Navaza, 1994), which is then used to compute the inverse Fourier transform in the space of rotations SO(3); both the SO(3) transform and the spherical harmonics decomposition are computed using the SOFT library (Kostelec & Rockmore, 2007).
The inverse SO(3) Fourier transform space may be parameterized using Euler angles α, β and γ as indices, with the values being the crosscorrelations between the structure and a rotated version of itself. The highest value is therefore obtained for angles α = β = γ = 0, but any structure with internal rotational symmetry about the origin will also have a peak representing each rotation which produces high crosscorrelation between the original and rotated structures. As the cyclic symmetry (denoted C_{n}, where n is the order of rotational symmetry) is defined as a for which any rotation by 2π/n radians about the symmetry axis does not change the shape, it is clear that any such symmetry will have a signature set of peaks detectable in the inverse SOFT map.
It then follows that by analysing the peaks in the inverse SOFT map, it should be possible to determine the position and order of rotational symmetry, thus detecting any C_{n} symmetry present in the structure. Once the C_{n} symmetries have been detected, it is further possible to determine the presence of any dihedral symmetries (D_{n}) owing to their property of consisting of two cyclic symmetries C_{n} and C_{2} with perpendicular axes of symmetry. Similarly, tetrahedral symmetry (T) has the characteristic property of having two C_{3} symmetries with an axis angle of cos^{−1}(1/3) ≃ 1.23 rad, while the icosahedral symmetry (I) can be detected by finding C_{5} and C_{3} symmetries with an angle between them of cos^{−1}(5^{1/2}/3) ≃ 0.73 rad.
The aforementioned rules for detecting the D, T and I symmetries are sufficient to find the appropriate symmetries of the structure. However, to find all of the symmetry operators the complete point groups need to be generated. Nonetheless, the two pointgroup elements listed in the aforementioned rules for each of the D, T and I symmetries are sufficient to generate the complete point groups; this follows from the fact that the two finite cyclic rotation generators are independent. Therefore, by using the inverse SOFT mapbased approach, reliable detection and complete pointgroup element generation is possible. An example of symmetry detection using ProSHADE is shown in Fig. 7.
7. Discussion
In this contribution, we describe several tools available from CCPEM (Wood et al., 2015) and CCP4 (Winn et al., 2011). We anticipate that the Divide and Conquer algorithm will become useful in facilitating the of large molecules with potentially multiple maps corresponding to multiple focused reconstructions. We emphasize the importance of selecting appropriate levels of map blurring/sharpening, which may be facilitated by considering the behaviour of the average map amplitude at different resolutions, and the utility of simultaneously viewing multiple blurred/sharpened maps. These tools are available from within the CCPEM interface (Burnley et al., 2017).
Model building using cryoEM maps poses special problems, and is often the most timeconsuming part of the cryoEM datainterpretation process. Several of the techniques available in Coot (Emsley et al., 2010) have successfully been used by structural biologists (see, for example, Casañal et al., 2017) for this purpose, notably Jiggle Fit for positioning and orienting the atomic model, Model Morphing to allow localized regions to be fitted into the map and the use of externally derived restraints that can be visualized as well as applied during to aid stability and/or improve geometry. Recent efforts towards parallelization have resulted in the ability to refine larger regions of the model concurrently.
We present a tool for symmetry identification from a given map or coordinate set: ProSHADE. Whilst it is likely that map symmetry is accounted for during threedimensional reconstruction, this information is often lost. Ideally, this information should be carried from reconstruction to the deposition of maps and atomic models. ProSHADE can identify the of a map, and thus may be used during deposition as well as during molecular visualization. Further details will be provided elsewhere. ProSHADE is available upon request, with the intention of distributing it via CCP4 and CCPEM in the future.
We also discuss the importance of phases, and the potential utility of replacing poorquality observations with their expectations. Specifically, with random amplitudes but exact phases, the correlation between the current and `true' maps is ∼78.5%. In contrast, when phases are random the FSC will be zero, irrespective of the accuracy of the amplitudes. We thus infer that phases are much more important than amplitudes (a fact that has been known for a long time). Furthermore, if we replace random amplitudes with their expectations then the FSC increases to ∼88.6%. Thus, if the structure factors have little or no information about the structure under analysis, then there may be utility in replacing the observations with their expected values.
More pragmatically, we demonstrate that there is an upper limit of [2FSC_{1/2}/(1 + FSC_{1/2})]^{1/2} on the correlation between observed and calculated Fourier coefficients, expressed in terms of the FSC between two halfdata sets. Should correlations be observed above this limit, further investigation would be warranted.
7.1. Future perspectives
Recent advances in cryoEM have resulted in this method rapidly becoming the method of choice for structural biologists, especially for those studying the threedimensional structures of very large macromolecular complexes. For the last 50 years or so, macromolecular crystallography, especially that using Xray scattering, has been the main technique for structure elucidation. Consequently, there is a wealth of accumulated experience and knowledge of this technique. It is tempting to reuse tools developed for Xray crystallography for cryoEM data analysis and modelling. Although some of these tools could well be transferred between these techniques legitimately, there are some significant differences that should be accounted for. Since both techniques are used to solve the structures of macromolecules, tools encapsulating prior knowledge about macromolecules can easily be transferred. These include the generation and use of restraints describing constituent blocks of macromolecules (Long et al., 2017) and the transfer of local conformational information between homologous macromolecular structures (Nicholls et al., 2012; Kovalevskiy et al., 2018). Moreover, since both are the result of particle scattering, most of the Fourierbased techniques can be used for analysing both types of experimental data.
However, there are significant differences between these experimental techniques and these need to be accounted for if the objective is to derive the `best' atomic model using noisy observations.
There are many other problems that require further attention, including (i) the
of very large molecules against very large maps, (ii) the validation of derived atomic models against observed data, (iii) the full automation of model building using similar and/or invariant substructures, (iv) optimal difference map calculation between observed maps and between observed and calculated maps, accounting for all sources of error along with the potential correlations between them, and (v) accurate atomic form factors for electron scattering.APPENDIX A
Relationships between different Fourier shell correlations
Relationships between Fourier shell correlations (FSCs) calculated in different scenarios have been described elsewhere (see, for example, Rosenthal & Henderson, 2003; Karplus & Diederichs, 2012), including the FSC between two halfdata reconstructed images, the FSC between two fulldata reconstructions and the FSC between observed and true Fourier coefficients. To make statements and calculations that are more precise, we will use the following assumptions: (i) there is no correlation between signal and noise, (ii) there is no correlation between the noise in the two halfdata reconstructions and (iii) the noise in the Fourier shells represents the noise in the data, i.e. the distribution of the noise for all Fourier coefficients in sufficiently narrow Fourier shells is the same as the distribution of the noise for any single Fourier coefficient taken from this shell. We will use the following notation:
where the subscripts 1 and 2 indicate the first and second halves of the data, F denotes Fourier coefficients, subscript T is for Fourier coefficients from `true' images and n denotes the noise. We also assume that the noise has zero mean, and that var(n_{1}) = var(n_{2}). As a consequence, var(n) = var(n_{1})/2. Under the assumptions stated above,
This can be expressed in terms of the ratio of the variance of the noise to the variance of the signal,
Now let us consider Fourier shell correlation between two full hypothetical data sets with Fourier coefficients F_{1full} and F_{2full}. In this case
Finally, we calculate the correlation between a full data reconstructed map and the `true' map,
To clarify, the FSC_{1/2} used in this paper is the Fourier shell correlation between halfdata reconstructions cor(F_{1}, F_{2}).
Acknowledgements
The authors would like to thank Tom Burnley for facilitating the distribution of software tools via CCPEM and the implementation of features in the CCPEM GUI, Paul Emsley for discussion and implementation of Coot functionalities, Jude Short for the examples of the use of external prior information involving RAD51, and Jake Grimmett and Toby Darling from the MRCLMB Scientific Computing Department for computing support and resources. We would also like to thank our colleagues from the MRCLMB for interesting problems, discussions and feedback, which continually inspire development.
Funding information
This work was supported by the Medical Research Council, by CCP4/STFC grant No. PR140014 (RAN), by BBSRC grant No. BB/L007010/1 (OK) and by MRC grant No. MC_US_A025_1012 (GNM and MT).
References
Afonine, P. V., GrosseKunstleve, R. W., Urzhumtsev, A. & Adams, P. D. (2009). J. Appl. Cryst. 42, 607–615. Web of Science CrossRef CAS IUCr Journals Google Scholar
Afonine, P. V., Headd, J. J., Terwilliger, T. C. & Adams, P. D. (2013). Comput. Crystallogr. Newsl. 4, 43–44. https://www.phenixonline.org/newsletter/CCN_2013_07.pdf. Google Scholar
Afonine, P. V., Moriarty, N. W., Mustyakimov, M., Sobolev, O. V., Terwilliger, T. C., Turk, D., Urzhumtsev, A. & Adams, P. D. (2015). Acta Cryst. D71, 646–666. Web of Science CrossRef IUCr Journals Google Scholar
Afonine, P. V., Poon, B. K., Read, R. J., Sobolev, O. V., Terwilliger, T. C., Urzhumtsev, A. & Adams, P. D. (2018). Acta Cryst. D74, 531–544. CrossRef IUCr Journals Google Scholar
Bartesaghi, A., Merk, A., Banerjee, S., Matthies, D., Wu, X., Milne, J. L. & Subramaniam, S. (2015). Science, 348, 1147–1151. Web of Science CrossRef CAS PubMed Google Scholar
Berman, H. M. et al. (2002). Acta Cryst. D58, 899–907. Web of Science CrossRef CAS IUCr Journals Google Scholar
Brown, A., Long, F., Nicholls, R. A., Toots, J., Emsley, P. & Murshudov, G. (2015). Acta Cryst. D71, 136–153. Web of Science CrossRef IUCr Journals Google Scholar
Brunger, A. T., DeLaBarre, B., Davies, J. M. & Weis, W. I. (2009). Acta Cryst. D65, 128–133. Web of Science CrossRef CAS IUCr Journals Google Scholar
Burnley, T., Palmer, C. M. & Winn, M. (2017). Acta Cryst. D73, 469–477. CrossRef IUCr Journals Google Scholar
Casañal, A., Kumar, A., Hill, C. H., Easter, A. D., Emsley, P., Degliesposti, G., Gordiyenko, Y., Santhanam, B., Wolf, J., Wiederhold, K. & Dornan, G. L. (2017). Science, 1056–1059. Google Scholar
Cowtan, K. (2003). IUCr Comput. Commun. Newsl. 2, 4–9. https://www.iucr.org/resources/commissions/crystallographiccomputing/newsletters/2. Google Scholar
Emsley, P., Lohkamp, B., Scott, W. G. & Cowtan, K. (2010). Acta Cryst. D66, 486–501. Web of Science CrossRef CAS IUCr Journals Google Scholar
Faruqi, A. R. & McMullan, G. (2011). Q. Rev. Biophys. 44, 357–390. Web of Science CrossRef CAS PubMed Google Scholar
Gong, Z., Schwieters, C. D. & Tang, C. (2015). PLoS One, 10, e0120445. CrossRef Google Scholar
Grigorieff, N. (2007). J. Struct. Biol. 157, 117–125. Web of Science CrossRef PubMed CAS Google Scholar
Headd, J. J., Echols, N., Afonine, P. V., GrosseKunstleve, R. W., Chen, V. B., Moriarty, N. W., Richardson, D. C., Richardson, J. S. & Adams, P. D. (2012). Acta Cryst. D68, 381–390. Web of Science CrossRef CAS IUCr Journals Google Scholar
Headd, J. J., Echols, N., Afonine, P. V., Moriarty, N. W., Gildea, R. J. & Adams, P. D. (2014). Acta Cryst. D70, 1346–1356. Web of Science CrossRef IUCr Journals Google Scholar
Heel, M. van & Schatz, M. (2017). bioRxiv, 224402. https://doi.org/10.1101/224402. Google Scholar
Huber, P. J. (2011). International Encyclopedia of Statistical Science, edited by M. Lovirc, pp. 1248–1251. Berlin, Heidelberg: Springer. Google Scholar
Jakobi, A. J., Wilmanns, M. & Sachse, C. (2017). eLife, 6, e27131. CrossRef Google Scholar
Joseph, A. P., Malhotra, S., Burnley, T., Wood, C., Clare, D. K., Winn, M. & Topf, M. (2016). Methods, 100, 42–49. Web of Science CrossRef CAS PubMed Google Scholar
Karplus, P. A. & Diederichs, K. (2012). Science, 336, 1030–1033. Web of Science CrossRef CAS PubMed Google Scholar
Kirkland, E. J. (2010). Advanced Computing in Electron Microscopy. New York: Springer. Google Scholar
Kirmizialtin, S., Loerke, J., Behrmann, E., Spahn, C. M. & Sanbonmatsu, K. Y. (2015). Methods Enzymol. 558, 497–514. CrossRef Google Scholar
Kostelec, P. J. & Rockmore, D. N. (2007). SOFT: SO(3) Fourier Transforms. https://www.cs.dartmouth.edu/~geelong/soft/. Google Scholar
Kovalevskiy, O., Nicholls, R. A., Long, F., Carlon, A. & Murshudov, G. N. (2018). Acta Cryst. D74, 215–227. CrossRef IUCr Journals Google Scholar
Kucukelbir, A., Sigworth, F. J. & Tagare, H. D. (2014). Nature Methods, 11, 63–65. Web of Science CrossRef CAS PubMed Google Scholar
Kühlbrandt, W. (2014). Elife, 3, e03678. PubMed Google Scholar
Long, F., Nicholls, R. A., Emsley, P., Gražulis, S., Merkys, A., Vaitkus, A. & Murshudov, G. N. (2017). Acta Cryst. D73, 112–122. Web of Science CrossRef IUCr Journals Google Scholar
Lyumkis, D., Brilot, A. F., Theobald, D. L. & Grigorieff, N. (2013). J. Struct. Biol. 183, 377–388. Web of Science CrossRef CAS PubMed Google Scholar
McCoy, A. J., GrosseKunstleve, R. W., Adams, P. D., Winn, M. D., Storoni, L. C. & Read, R. J. (2007). J. Appl. Cryst. 40, 658–674. Web of Science CrossRef CAS IUCr Journals Google Scholar
Murshudov, G. N. (2016). Methods Enzymol. 579, 277–305. Web of Science CrossRef CAS PubMed Google Scholar
Murshudov, G. N., Skubák, P., Lebedev, A. A., Pannu, N. S., Steiner, R. A., Nicholls, R. A., Winn, M. D., Long, F. & Vagin, A. A. (2011). Acta Cryst. D67, 355–367. Web of Science CrossRef CAS IUCr Journals Google Scholar
Navaza, J. (1994). Acta Cryst. A50, 157–163. CrossRef CAS Web of Science IUCr Journals Google Scholar
Nicholls, R. A., Fischer, M., McNicholas, S. & Murshudov, G. N. (2014). Acta Cryst. D70, 2487–2499. Web of Science CrossRef IUCr Journals Google Scholar
Nicholls, R. A., Kovalevskiy, O. & Murshudov, G. N. (2017). Methods Mol Biol. 1607, 565–593. CrossRef Google Scholar
Nicholls, R. A., Long, F. & Murshudov, G. N. (2012). Acta Cryst. D68, 404–417. Web of Science CrossRef CAS IUCr Journals Google Scholar
Nicholls, R. A., Long, F. & Murshudov, G. N. (2013). Advancing Methods for Biomolecular Crystallography, edited by R. Read, A. Urzhumtsev & V. Y. Lunin, pp. 231–258. Dordrecht: Springer. Google Scholar
Pellegrini, L., Yu, D. S., Lo, T., Anand, S., Lee, M., Blundell, T. L. & Venkitaraman, A. R. (2002). Nature (London), 420, 287–293. Web of Science CrossRef PubMed CAS Google Scholar
Peng, L.M., Ren, G., Dudarev, S. L. & Whelan, M. J. (1996). Acta Cryst. A52, 257–276. CrossRef CAS Web of Science IUCr Journals Google Scholar
Pettersen, E. F., Goddard, T. D., Huang, C. C., Couch, G. S., Greenblatt, D. M., Meng, E. C. & Ferrin, T. E. (2004). J. Comput. Chem. 25, 1605–1612. Web of Science CrossRef PubMed CAS Google Scholar
Rawi, R., Whitmore, L. & Topf, M. (2010). Bioinformatics, 26, 1673–1674. CrossRef CAS PubMed Google Scholar
Roseman, A. M. (2000). Acta Cryst. D56, 1332–1340. Web of Science CrossRef CAS IUCr Journals Google Scholar
Rosenthal, P. B. & Henderson, R. (2003). J. Mol. Biol. 333, 721–745. Web of Science CrossRef PubMed CAS Google Scholar
Scheres, S. H. W. (2012). J. Struct. Biol. 180, 519–530. Web of Science CrossRef CAS PubMed Google Scholar
Scheres, S. H. W. (2014). Elife, 3, e03665. CrossRef Google Scholar
Schröder, G. F., Brunger, A. T. & Levitt, M. (2007). Structure, 15, 1630–1641. Web of Science PubMed Google Scholar
Schröder, G. F., Levitt, M. & Brunger, A. T. (2010). Nature (London), 464, 1218–1222. Web of Science PubMed Google Scholar
Settembre, E. C., Chen, J. Z., Dormitzer, P. R., Grigorieff, N. & Harrison, S. C. (2011). EMBO J. 30, 408–416. CrossRef Google Scholar
Sheldrick, G. M. (2015). Acta Cryst. C71, 3–8. Web of Science CrossRef IUCr Journals Google Scholar
Short, J. M., Liu, Y., Chen, S., Soni, N., Madhusudhan, M. S., Shivji, M. K. & Venkitaraman, A. R. (2016). Nucleic Acids Res. 44, 9017–9030. Google Scholar
Skubák, P. & Pannu, N. S. (2013). Nature Commun. 4, 2777. Google Scholar
Smart, O. S., Womack, T. O., Flensburg, C., Keller, P., Paciorek, W., Sharff, A., Vonrhein, C. & Bricogne, G. (2012). Acta Cryst. D68, 368–380. Web of Science CrossRef CAS IUCr Journals Google Scholar
Sun, L., Zhang, X., Gao, S., Rao, P. A., PadillaSanchez, V., Chen, Z., Sun, S., Xiang, Y., Subramaniam, S., Rao, V. B. & Rossmann, M. G. (2015). Nature Commun. 6, 7548. CrossRef Google Scholar
Terwilliger, T. C., Read, R. J., Adams, P. D., Brunger, A. T., Afonine, P. V. & Hung, L.W. (2013). Acta Cryst. D69, 2244–2250. Web of Science CrossRef IUCr Journals Google Scholar
Terwilliger, T. C., Sobolev, O. V., Afonine, P. V. & Adams, P. D. (2018). Acta Cryst. D74, 545–559. CrossRef IUCr Journals Google Scholar
Trabuco, L. G., Villa, E., Mitra, K., Frank, J. & Schulten, K. (2008). Structure, 16, 673–683. Web of Science CrossRef PubMed CAS Google Scholar
Trabuco, L. G., Villa, E., Schreiner, E., Harrison, C. B. & Schulten, K. (2009). Methods, 49, 174–180. Web of Science CrossRef PubMed CAS Google Scholar
Vagin, A. & Teplyakov, A. (2010). Acta Cryst. D66, 22–25. Web of Science CrossRef CAS IUCr Journals Google Scholar
Wang, R. Y.R., Song, Y., Barad, B. A., Cheng, Y., Fraser, J. S. & DiMaio, F. (2016). Elife, 5, e17219. Google Scholar
Winn, M. D. et al. (2011). Acta Cryst. D67, 235–242. Web of Science CrossRef CAS IUCr Journals Google Scholar
Wlodawer, A. & Dauter, Z. (2017). Acta Cryst. D73, 379–380. CrossRef IUCr Journals Google Scholar
Wlodawer, A., Li, M. & Dauter, Z. (2017). Structure, 25, 1589–1597. Web of Science CrossRef CAS PubMed Google Scholar
Wood, C., Burnley, T., Patwardhan, A., Scheres, S., Topf, M., Roseman, A. & Winn, M. (2015). Acta Cryst. D71, 123–126. Web of Science CrossRef IUCr Journals Google Scholar
Zhang, F., Chen, Y., Ren, F., Wang, X., Liu, Z. & Wan, X. (2017). IEEE/ACM Trans. Comput. Biol. Bioinform. 14, 316–325. CrossRef Google Scholar
This is an openaccess article distributed under the terms of the Creative Commons Attribution (CCBY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.