research papers
The befores and afters of molecular replacement
aYork Structural Biology Laboratory, Chemistry Department, University of York, York YO10 5DD, England
*Correspondence e-mail: e.dodson@ysbl.york.ac.uk
This review addresses the essential questions to consider when attempting to phase a new
using Sequence matching can suggest whether there is a suitable three-dimensional model available, but it is also important to analyse the model in order to find its likely oligomeric state and to establish whether there are likely to be domain movements. Once a solution has been found it must be refined, which can be challenging for low-homology models. There is a detailed discussion of structures used as examples for CCP4 tutorials.Keywords: bioinformatics; molecular replacement; validation.
1. Introduction
Most of the readers of this volume are probably structural biologists, with a bias towards biology rather than structure. The discipline of crystallography is now fairly mature and can provide semi-automated tools to determine a structure without requiring a detailed understanding of the technical procedures. Users want to understand how a particular macromolecule fits into the machinery of a living cell and knowledge of its three-dimensional geometry can illuminate this.
However, to obtain such a model we need firstly to understand the known biochemistry, secondly to obtain protein, grow a crystal and collect observable intensities, and thirdly either to determine some experimental phases to allow the first model to be built or to use molecular-replacement (MR) techniques to position a known model in the new cell and thus generate initial phases. The final stage is to refine this model to one most consistent with the observed data.
1.1. Tutorials
The examples I will discuss are used for molecular-replacement tutorial material available from CCP4.
Alexei Vagin and Andrey Lebedev have prepared a tutorial which is available as part of the MOLREP download from https://www.ysbl.york.ac.uk/~alexei/molrep.html#installation .
Martyn Winn and I prepared extra material for a workshop in China. It is available at https://www.ccp4.ac.uk/courses/china06/tutorials/mr_tutorial_first.html and https://www.ccp4.ac.uk/courses/china06/tutorials/mr_tutorial_advanced.html .
2. The known biochemistry
It is safe to assume that all structural projects begin with knowledge of the sequence of the molecule under study and hence its molecular weight. The first step in determining a structure is to search the available databases to see what is already known; without an existing three-dimensional model
is not an option for structure solution and the experimental design will require the measurement of extra sets of observed intensities for the determination of phases.There is now a wealth of sequence information available for many organisms and excellent bioinformatics tools for searching these sequence databases. Likely homologous models can be gleaned by matching the sequence to those of known structures. There are many web-based tools for this, several of which are described by other contributors. I will illustrate the approach using the MSDtarget tool from the European Bioinformatics Institute (EBI) Macromolecular Structure Division (MSD; https://www.ebi.ac.uk/msd ). This returns a list of targets and a pairwise alignment for all likely models. More sophisticated multiple sequence-alignment tools are described in Barton (2008) and Schwarzerbacher et al. (2008).
There may well be several models available and it is possible to learn more about your system by analysing and comparing these. Again there are many tools available, but I will use the EBI MSD tool MSDfold, developed by Eugene Krissinel (Krissinel & Henrick, 2004). This matches the secondary-structure elements of all models to the selected target and aligns them. (It may well also find other examples with similar folds but lower sequence identity not found by MSDtarget.) When there is low sequence homology, the secondary-structure matching may give a slightly different sequence alignment to that based on sequence alone.
It is useful to inspect the overlap of these aligned models. This can reveal domain movement between one model and another and unless this is treated properly it can make it very difficult to obtain any MR solution. The aligned domains can be used as input for the existing MR programs that accept multiple overlapping copies
It is also important to follow up clues to the likely biological entity, e.g. does this protein form an oligomer? The EBI tool MSDpisa analyses this and returns a set of coordinates for the assembly, as well as reporting the buried surface area, hydrogen bonds and so on (Krissinel & Henrick, 2007).
2.1. Examples
2.1.1. Human S100 A12 (S100)
This structure has been deposited with PDB code 1e8a (Moroz et al., 2001). Some of the MSDtarget output obtained from the S100 sequence search is given in Fig. 1(a); Fig. 1(b) shows the pairwise matches. I will discuss models 1irj (41% sequence identity) and 1mho (39% sequence identity).
MSDpisa indicates that both models are likely to be dimers with buried surface areas of 1282 and 1321 Å2, respectively. Figs. 1(c) and 1(d) show the alignment of these dimers. It is clear that their dimer interfaces are slightly different. A post mortem comparison of these models with the S100 structure shows that the root-mean-square (r.m.s.) difference in Cα positions of the monomers is 0.88 Å (1irj ) and 1.23 Å (1mho ), whilst for the 1irj dimer it is 2.64 Å and for the 1mho dimer it is 1.68 Å. If searching with a monomer 1irj would prove the better model, but if searching with a dimer the 1mho example is better. In practice, it is sensible to try all available models in all likely oligomeric states.
2.1.2. Sugar phosphotase in the closed form
This structure has been deposited with PDB code 1tj3 (Fieulaine et al., 2005). There are several models with 100% sequence identity. However, there is clearly a hinged domain movement; the r.m.s. distance for Cα atoms between two of these models, 2d2v and 1s2o , is 2.4 Å. After overlapping the models (Figs. 2a and 2b), it is clear that there are two domains, one made up of residues 1–73 and 163–244, and another consisting of residues 74–162. In such a case it is necessary to search for a solution with each domain separately.
2.1.3. Insulin
At high concentration and in the presence of a metal, insulin exists as a hexamer made up of three dimers each with two chemically identical monomers (Fig. 3). There are many crystal structures of insulin hexamers, some with one or more hexamers in the crystal and some containing monomers or dimers, with the hexamer generated by crystal symmetry (Baker et al., 1988). An analysis of the contents of the may suggest the likely stoichiometry and a self-rotation function may suggest the nature of the However, the interactions between crystallographic and can become very complex.
2.1.4. Family 2 carbohydrate esterase (CE2)
This is a 345-residue protein solved by MR from a low-homology model. Some experimental phase information was also obtained from the
power of two Se atoms.The MR solution was verified by checking it against the known selenium positions.
2.1.5. hypF
This et al., 2002). It was solved from experimental phases using a Hg derivative (the images were used for the data-processing tutorial described in https://www.mrc-lmb.cam.ac.uk/harry/imosflm/tutorial.html ). It was later refined against 1.3 Å data and deposited with PDB code 1gxu .
is of the prokaryotic hydrogenase maturation factor hypF acylphosphatase-like domain with a bound anion (RosanoIt can also be solved straightforwardly by MR using the model 1w2i with 38% sequence identity. I have included it to illustrate how the phase carried out using the program ACORN (Yao et al., 2005) can improve the map and reduce the bias towards the initial model.
3. Planning the crystallography
While studying the bioinformatics information based on sequence, one hopes that a large crystal of the protein of interest is growing. The type of diffraction measurements required to solve the X-ray structure will depend to some extent on the chosen solution method. For experimental phasing, it is necessary to have a detectable substructure incorporated into the crystal, either anomalous scatterers or heavy atoms. Accurate measurements of the differences arising from that
to a limited resolution are needed to first position the and then estimate experimental phases. For phase extension and we need the highest observable resolution plus complete low-resolution data. To solve the a single complete data set to modest resolution is enough, but again the MR solution model must be refined and this is much more straightforward with higher resolution data.If possible, it helps to use both the MR solution model and experimental phase information during the et al., 1998).
step. These phases will not be biased towards the initial model and so can help when rebuilding and act as additional restraints to speed up (PannuAs an aside, it is important to remember that when combining information from two (or more) diffraction experiments it is essential that the data sets are indexed according to the same convention and that the MR model and the https://www.ccp4.ac.uk/dist/html/reindexing.html and https://www.ccp4.ac.uk/dist/html/alternate_origins.html .
are positioned relative to the same origin. There is discussion of these conventions in the CCP4 program documentation. SeeA simple way to achieve this is to calculate phases from the MR model and use these to produce anomalous or isomorphous difference maps with the data to be used for estimating experimental phases. If there are already more than one set of phases available, then the Clipper utility Phase Comparison (Cowtan, 2003) checks consistency and makes the appropriate corrections for any required origin shift or change of hand.
3.1. Assessing the quality of diffraction data
The diffraction experiment will reveal the unit-cell parameters and shows various plots taken from the output of the TRUNCATE program which may indicate problems. There is a discussion of indicators of data quality at https://www.ccp4.ac.uk/dist/html/pxmaths/bmg10.html and of the at https://www.ccp4.ac.uk/dist/html/twinning.html . The program SFCHECK (Vaguine et al., 1999) is another tool for data analysis. As well as detecting anisotropy and possible it reports noncrystallographic translation.
of our new crystal form. As for any X-ray study, it is important to assess the quality of the experimental data. It should be complete at low resolution and extend to the highest resolution available to help the procedures. The data-reduction software gives some analysis of other problems which may arise. Is the crystal twinned? Is the diffraction very anisotropic? Fig. 43.2. Determining the space group
It is often not possible to assign a e.g. P21 will have absences for all 0k0 reflections where k is odd. However, any pseudo-translation vector (x, 0.5, z) will also cause the same reflections to have very weak intensities. There are other space groups where the generates the same Examples are space groups P41 and P43 or P61 and P65. The MR search should settle this uncertainty since one of the possible space groups should score significantly higher than any of the alternatives.
unambiguously at this stage. Absences along particular axes indicate screw axes,3.3. What can we estimate from sequence and diffraction?
From the volume of the crystal a and 5b). However, this can be masked by crystal symmetry and be very confusing to interpret! Insulin studies illustrate this: the intersecting twofold and threefold axes of the hexamer are sometimes crystallographic and sometimes not and the can consist of monomers, dimers, trimers or hexamers. There are examples of structures in many different space groups, e.g. H32 and P6322 with a monomer in the H3 and P213 with a dimer, P321 with three molecules in the a trimer on the twofold axis and a monomer at the 32 centre, and P21 with the whole hexamer in the asymmetric unit.
and the molecular weight of the protein, it is possible to estimate how many independent copies of the molecule under investigation are likely to be in the If there is more than one it is important to check whether there is a element or a noncrystallographic translation vector relating them. Both these can be predicted from the X-ray data alone. If there is extra symmetry such as a noncrystallographic twofold axis, the self-rotation function may reveal it (Figs. 5If there is a noncrystallographic translation the 4 Å native Patterson will have a large off-origin peak at the position representing this translation. Unlike noncrystallographic rotations, noncrystallographic translations are not particularly useful in
In fact, they introduce awkward structure-factor correlations that are not currently accounted for and can make structures difficult to refine.4. Molecular-replacement techniques and software
The methodology is discussed by other authors in this issue.
5. Verifying the solution
As an aside, remember that it can be difficult to compare solutions from different programs, since the calculated amplitudes will be the same irrespective of any Clipper utility Phase Comparison will indicate whether the solutions are consistent after taking into account the choice of origin.
operator applied to the solution or alternate choice of unit-cell origin. If phases are calculated from both models, the5.1. Space-group check
The MR search programs can be run in the several alternate space groups consistent with the
A good indicator is if there is a significantly better result in one than the others. (Different software uses different scoring functions, but all require a strong correlation between the observed and calculated amplitudes.)5.2. Chemical sense
We need to check whether the model makes chemical sense. Are there many clashes between symmetry copies? Is the biological entity sensible? (This can be somewhat tricky to check from the MR solution alone; many MR search programs will position the correct number of molecules but not cluster them in the MSDpisa can be used to select the best assembly from the solution.) If there are several molecules in the are they consistent with the self-rotation function? If you have some extra information such as possible positions for Se or S atoms, is this model consistent with it? (Remember to consider alternate origins and hands.)
Once again5.3. Can the model be refined?
The usual check is that the solution model generates structure amplitudes which agree with the observed ones. Initial R values always seem to be high (typically R/free R of 55%/55% for me), but correct solutions will (usually!) refine automatically to an R/free R of about 40%/45%. The most encouraging verification is the electron density: if you can see features in the maps which are not part of the model, then the solution is probably substantially correct (Fig. 6).
6. tricks and bias elimination
There are still intractable problems in progressing from an initial MR solution to a final model which reflects the differences between the initial search molecule and that under investigation. There is no foolproof way of recognizing where the two models will differ and the initial maps will tend to mirror the partially incorrect input structure, especially if there is a paucity of experimental data. It is still sometimes necessary to rebuild the structure slowly into a series of weighted difference maps.
If the resolution is sufficient, automated rebuilding methods combined with ARP/wARP (described by Cohen et al., 2008) and RESOLVE (Terwilliger, 2002) are well established methods for automated rebuilding.
weighted can be very successful, rebuilding and correcting most of the molecule.If the data resolution extends to 1.7 Å or better, density-modification procedures such as those programmed into ACORN can eliminate bias quickly and give excellent starting maps (Figs. 7a and 7b).
6.1. Ingenuity: use all your crystallographic knowledge
There are many interesting reports of structure solution which ingeniously combine different crystallographic techniques for obtaining the final model. I list some of them here for reference.
7. Conclusions
As more and more structural information becomes available, greatly improved bioinformatics tools are being developed to analyse and display it. Although
is becoming automated, there is still a place for crystallographic and biological insight. In some cases this can be challenging; the interaction of different symmetry elements is often extremely complex. The final frontier of automating of MR models has still not been reached.Acknowledgements
This review rests heavily on the work of others. It borrows from tutorial material prepared by Airlie McCoy, Alexei Vagin, Andrey Lebedev and Martyn Winn. Members of the York Structural Biology Laboratory have provided data and valuable discussions. In particular, I would like to thank Olga Morez, Carlos Martinez-Fleites, David Lawson, Carmelo Rosano and Victoria Money for providing examples. Liz Potterton helped to prepare the figures using CCP4MG (Potterton et al., 2004).
References
Baker, E. N., Blundell, T. L., Cutfield, J. F., Cutfield, S. M., Dodson, E. J., Dodson, G. G., Crowfoot Hodgkin, D. M., Hubbard, R. E., Isaacs, N. W., Reynolds, C. D., Sakabe, K., Sakabe, N. & Vijayan, N. M. (1988). Philos. Trans. R. Soc. London Ser. B, 319, 369–456. CrossRef CAS Web of Science Google Scholar
Barton, G. J. (2008). Acta Cryst. D64, 25–32. Web of Science CrossRef CAS IUCr Journals Google Scholar
Cohen, S. X., Ben Jelloul, M., Long, F., Vagin, A., Knipscheer, P., Lebbink, J., Sixma, T. K., Lamzin, V. S., Murshudov, G. N. & Perrakis, A. (2008). Acta Cryst. D64, 49–60. Web of Science CrossRef CAS IUCr Journals Google Scholar
Cowtan, K. (2003). IUCr Comput. Commission Newsl. 2, 4–9. https://www.iucr.org/iucr-top/comm/ccom/newsletters/2003jul/index.html . Google Scholar
Emsley, P. & Cowtan, K. (2004). Acta Cryst. D60, 2126–2132. Web of Science CrossRef CAS IUCr Journals Google Scholar
Fieulaine, S., Lunn, J. E., Borel, F. & Ferrer, J.-L. (2005). Plant Cell, 17, 2049–2058. Web of Science CrossRef PubMed CAS Google Scholar
Keller, S., Pojer, F., Heide, L. & Lawson, D. M. (2006). Acta Cryst. D62, 1564–1570. Web of Science CrossRef IUCr Journals Google Scholar
Krissinel, E. & Henrick, K. (2004). Acta Cryst. D60, 2256–2268. Web of Science CrossRef CAS IUCr Journals Google Scholar
Krissinel, E. & Henrick, K. (2007). J. Mol. Biol. 372, 774–797. Web of Science CrossRef PubMed CAS Google Scholar
Martinez-Fleites, C., Ortiz-Lombardia, M., Pons, T., Tarbouriech, N., Taylor, E. J., Hernandez, L. & Davies, G. J. (2005). Biochem. J. 390, 19–27. Web of Science PubMed CAS Google Scholar
Moroz, O. V., Antson, A. A., Murshudov, G. N., Maitland, N. J., Dodson, G. G., Wilson, K. S., Skibshøj, I., Lukanidin, E. M. & Bronstein, I. B. (2001). Acta Cryst. D57, 20–29. Web of Science CrossRef CAS IUCr Journals Google Scholar
Pannu, N. S., Murshudov, G. N., Dodson, E. J. & Read, R. J. (1998). Acta Cryst. D54, 1285–1294. Web of Science CrossRef CAS IUCr Journals Google Scholar
Potterton, L., McNicholas, S., Krissinel, E., Gruber, J., Cowtan, K., Emsley, P., Murshudov, G. N., Cohen, S., Perrakis, A. & Noble, M. (2004). Acta Cryst. D60, 2288–2294. Web of Science CrossRef CAS IUCr Journals Google Scholar
Rosano, C., Zuccotti, S., Bucciantini, M., Stefani, M., Ramponi, G. & Bolognesi, M. (2002). J. Mol. Biol. 321, 785–796. Web of Science CrossRef PubMed CAS Google Scholar
Schwarzenbacher, R., Godzik, A. & Jaroszewski, L. (2008). Acta Cryst. D64, 133–140. Web of Science CrossRef IUCr Journals Google Scholar
Terwilliger, T. C. (2003). Acta Cryst. D59, 38–44. Web of Science CrossRef CAS IUCr Journals Google Scholar
Vaguine, A. A., Richelle, J. & Wodak, S. J. (1999). Acta Cryst. D55, 191–205. Web of Science CrossRef CAS IUCr Journals Google Scholar
Yao, J.-X, Woolfson, M. M, Wilson, K. S. & Dodson, E. J. (2005). Acta Cryst. D61, 1465–1475. Web of Science CrossRef CAS IUCr Journals Google Scholar
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.