research papers
Automated MAD and MIR structure solution
^{a}Structural Biology Group, Mail Stop M888, Los Alamos National Laboratory, Los Alamos, NM 87545, USA, and ^{b}Biophysics Group, Mail Stop D454, Los Alamos National Laboratory, Los Alamos, NM 87545, USA
^{*}Correspondence email: terwilliger@lanl.gov
Obtaining an electrondensity map from Xray diffraction data can be difficult and timeconsuming even after the data have been collected, largely because MIR and MAD structure determinations currently require many subjective evaluations of the qualities of trial heavyatom partial structures before a correct heavyatom solution is obtained. A set of criteria for evaluating the quality of heavyatom partial solutions in macromolecular crystallography have been developed. These have allowed the conversion of the crystal structuresolution process into an optimization problem and have allowed its automation. The SOLVE software has been used to solve MAD data sets with as many as 52 selenium sites in the The automated structuresolution process developed is a major step towards the fully automated structuredetermination, modelbuilding and procedure which is needed for genomic scale structure determinations.
Keywords: MAD; MIR; automated structure solution.
1. Introduction
Recently, the pace of macromolecular et al., 1977), but by 1997 this had increased tenfold to 1640. At the same time that the rate of obtaining new structures has been increasing, the time required to obtain a particular new structure has decreased. Within the past year, there have been a number of cases in which a protein has been solved within one day of collecting Xray data (e.g. V. Ramakrishnan, personal communication; R. Fahrner & D. Eisenberg, personal communication; S.H. Kim, personal communication, R. Stevens, personal communication). While the current pace of macromolecular is impressive, it will require much greater throughput if it is to ever be applied on a scale which compares with the genomic sequencing projects now under way. If macromolecular structures could be determined at even more rapid rates, it would become possible to determine structures of broad groups of proteins on a genomic scale (e.g. Pennisi, 1998; Rost, 1998; Shapiro & Lima, 1998; Terwilliger et al., 1998).
by Xray crystallography and NMR has seen a rapid acceleration. In 1990 just 164 new macromolecular structures were added to the Protein Data Bank (Bernstein1.1. `Solving' structures using MAD or MIR Xray data
One of the limiting stages in macromolecular ; Hendrickson & Ogata, 1997). This stage of structure solution is often difficult because the partial structures of the heavy or anomalously scattering atoms which have to be solved in the MIR and MAD methods can be very complicated. Furthermore, it can be both timeconsuming and challenging to identify and verify these partial structures. Structure solution by MIR or MAD currently involves many steps which require decisions to be made by the crystallographer and requires operation of several computer programs or different parts of a software package to carry out. If this process could be carried out in an automated fashion, the time required to solve a macromolecular structure once data has been collected might be greatly reduced. Despite the complexity of the MIR or MAD structuredetermination process, each of the individual steps is well defined, and most possible outcomes and decisions which must be made can be anticipated in advance. Additionally, suitable computational algorithms exist for every stage in the process. This means that a complete automation of the structuredetermination process is achievable, at least in principle.
by Xray crystallography can be `solving' the structure using multiple (MIR) or multiwavelength (MAD) Xray data (Ke, 1997The MIR and MAD structuredetermination procedures are closely related and have several critical steps in common. Two of these are the identification of possible partial structures of the heavy or anomalously scattering atoms in the structure and the evaluation of the quality of each of these solutions. In both the MAD and MIR methods, possible partial structures of the heavy or anomalously scattering atoms are generally obtained either by manual or semiautomated inspection of difference Patterson functions (Terwilliger et al., 1987; Chang & Lewis, 1994; Vagin & Teplyakov, 1998) or by (Sheldrick, 1990; Miller et al., 1994). For example, a semiautomated procedure (`HASSP') which is widely used for generating possible partial structures is based on the of Buerger (1970) and yields a ranked list of partial structures which are compatible with the difference (Terwilliger et al., 1987). Such a list of potential solutions to the difference Patterson is only a starting point in either the MAD or MIR methods, however, as each potential partial structure must then be individually completed and evaluated.
In the MAD method, the trial anomalously scattering atom partial structure is generally refined and used to identify further anomalously scattering atoms by difference Fourier (or anomalous difference Fourier) analysis. The completed partial structure is then used to calculate phases for the entire structure, and the resulting electron density is examined visually to determine if it has the features expected of the macromolecule. This visual examination is crucial for determining whether the entire process has been successful, but there are several criteria which are commonly used at earlier stages to determine whether the structuredetermination process is going well. These include the compatibility of the partial structure with the anomalous or dispersive difference Patterson functions, the figure of merit of phasing and the appearance of anomalously scattering atom sites in difference Fourier analyses calculated after omitting these sites in phasing.
The process of completing a trial heavyatom partial structure in the MIR method differs slightly from that used in the MAD approach because the partial structures of heavy atoms must generally be determined in more than one heavyatom derivative. Starting solutions for heavyatom partial structures can usually be obtained for each of the available heavyatom derivatives. These trial partial structures are ordinarily then refined and used to calculate phases for the native structure. The native phases are in turn used to calculate difference Fouriers for the other derivatives in order to identify possible heavyatom sites in those derivatives. Additional heavyatom sites identified in this way are included in the phasing, and the process is repeated until no further sites are found. As in the case of MAD
the structure is generally considered solved when the resulting native electrondensity map is interpretable by the crystallographer. Indications that the is proceeding well are similar to those used in MAD They include the compatibility of each heavyatom partial structure with the corresponding difference the figure of merit of the phasing and crossdifference Fourier analyses involving the use of one set of derivatives in the phasing calculation and calculating a crossdifference Fourier for a different derivative.1.2. Automated decisionmaking during structure determination
There are several important decisions which must be made by the crystallographer during structure solution by either the MAD or MIR methods. At early stages in the process, a key decision is to choose which trial partial structures are worth pursuing further. At later stages, key decisions must be made as to whether a particular peak found in a difference Fourier analysis should be included as part of the heavyatom partial structure or not and which hand of the heavy atoms is correct. In the final stages of
key decisions include the decision as to which of the possible partial structures is most likely to be correct and whether the structuresolution process is completed.An important aspect of the present work is the recognition that all these decisions could be made in a uniform way if a suitable scoring algorithm could be developed. With a scoring procedure, the decisionmaking process with incompletely defined criteria described above becomes instead an optimization process with a well defined target function. For example, if a list of trial heavyatom or anomalously scatteringatom partial structures could be scored in a useful way and ranked, then the highestscoring partial structures at each stage of the analysis would be most likely to be correct and could be pursued more aggressively than lowerscoring solutions. Additional sites would be included in a partial structure and the inverse heavyatom partial structure would be used if doing so increased the score. The structuredetermination process would be completed when no partial structures with higher scores than those of the current set could be obtained. Based on this analysis, we propose that the development of a comprehensive scoring procedure for heavyatom partial structures could make the process of SOLVE') for macromolecular by the MIR or MAD approaches.
well defined and amenable to automation. In this, we describe such a scoring system and the resulting fully automated system (`2. Materials and methods
2.1. Evaluation of the match between a heavyatom partial structure and a Patterson or difference Patterson function
The first criterion we use for evaluating a trial heavyatom solution is whether the ). Our scoring in this case essentially consists of the average value of the at predicted locations of peaks, multiplied by a weighting factor based on the number of heavyatom sites in the trial solution. The complete raw score for a match between the and a trial solution is given by
calculated from a heavyatom partial structure matches the observed Patterson or difference This has always been an important criterion in the MIR and MAD methods (Blundell & Johnson, 1976where there are predicted interatomic vectors in the et al., 1987). To exclude contributions from very high peaks which are unlikely to correspond to interatomic vectors in the model, occupancies of each heavyatom site are refined so that the predicted peak heights match the observed peak heights at the predicted interatomic positons as closely as possible. All peak heights more than 1σ higher than their predicted values are then truncated at this height. The average value of the at predicted interatomic vectors estimated in this way is then corrected for instances where several predicted vectors unrelated by symmetry fall on the same location by scaling it by the fraction of predicted vectors which are unique, . Finally, a weighting function w(N) (see below) is applied to this average value to give the raw Patterson score.
for the trial partial structure. In this calculation, the Patterson or difference is first normalized to its r.m.s. value. Then peaks which occur in regions of the where symmetryrelated interatomic vectors coincide are divided by this (Terwilliger2.2. Calculation of crossvalidation difference Fourier maps
The second criteria used to evaluate heavyatom solutions is whether each heavyatom site appears in a `crossvalidation' difference Fourier analysis calculated after omitting this site (and all equivalent sites in other derivatives) from the phase calculation. A related approach in which one derivative is omitted from phasing and the other derivatives are used to phase a difference Fourier has been used for some time (Dickerson et al., 1961). Our raw score for crossvalidation difference Fouriers is the average peak height calculated in this way for each heavyatom site, multiplied by the weighting function w(N) described below.
2.3. Weighting function for Patterson and crossvalidation difference Fourier scores
Our unweighted raw scores for evaluation of Patterson and crossvalidation difference Fouriers are based simply on average peak height. It seems likely that in most cases, if two solutions are being considered and they have equal average peak heights but differing numbers of heavyatom sites, the solution with the larger number of sites is more likely to be correct. On the other hand, just how to weight this increase in number of sites is not clear. If the average peak height is simply multiplied by the number of sites, then solutions with very low average peak heights can receive high scores, for example. We have chosen an intermediate ground. The weighting function w(N) we use for the crossvalidation difference Fourier is designed to favor the addition of a new site to an existing partial structure with N − 1 sites as long as the average value of the peaks at the additional sites is at least a fraction of the average for the existing sites. A weighting function which has this property is given by
This weighting function is applied to both the Patterson and crossvalidation difference Fourier scores. In the case of the N, but we use N in the calculation of the weighting factor w(N) so as to make the weighting the same for Patterson and Fourier scoring. The parameter is ordinarily set at a level of 0.2–0.35, so that additional sites which yield crossvalidation difference Fourier peak heights 1/5 to 1/3 of the average would just be included in the heavyatom model.
there are generally more predicted interatomic vectors than heavyatom sites2.4. Evaluation of figure of merit of phasing
An important criteria for evaluating the quality of phasing in both the MAD and MIR methods is the overall figure of merit m (Blundell & Johnson, 1976). This parameter is sensitive to errors in heavyatom occupancies, to the resolution of the data and to the method used to calculate phases. Nevertheless, if a single procedure is used consistently then it can be used to distinguish between solutions which have more or less potential for accurate phasing. Additionally, in the SOLVE procedure, heavyatom occupancies are refined by originremoved Patterson which has been demonstrated to yield relatively unbiased estimates of occupancy (Terwilliger & Eisenberg, 1983). The raw score by this criteria is simply the unweighted average figure of merit for all reflections included in phasing.
2.5. Evaluation of distinction between solvent and macromolecule in native Fourier
The final criteria used in our scoring procedure is whether the native Fourier (electrondensity map) calculated based on the trial heavyatom solution has the features expected of a crystal of a macromolecule. We have focused on one such feature which is relatively simple to evaluate, namely whether the map has distinct regions of solvent and macromolecule (Terwilliger & Berendzen, 1999). Our measure of this distinction is the variation, from one location to another in the native Fourier, of the r.m.s. electron density (not including the F_{000} term in the Fourier synthesis). In regions which contain solvent, the native Fourier is flat and the r.m.s. electron density calculated in this way is very low. In regions containing the macromolecule, the native Fourier has many peaks and valleys and the r.m.s. electron density is high. A map with a clear definition of solvent and macromolecule will have a high variation of local r.m.s. electron density from location to location in the map. The raw score for this criteria is the standard deviation of the local r.m.s. electron density calculated in boxes with dimensions approximately twice the resolution of the map in each direction.
We have shown elsewhere (Terwilliger & Berendzen, 1999) that a score of this type calculated from the native Fourier can be an excellent indicator of the quality of the map when the map is of moderate or better quality. Based on model calculations, this score is useful when the mean phase error for the map is about 80° or less. This corresponds roughly to a figure of merit of phasing of about 0.2 or greater.
2.6. Calculation of final score for a heavyatom partial structure
The overall scoring procedure is in three steps. A starting set of 10–50 trial heavyatom partial structures are each given raw scores based on each of the four criteria described above and shown in Table 1. The mean and standard deviation of the raw scores for each criterion are calculated and are then used as a basis for normalizing all these and later raw scores to yield Z scores for each criteria, where the Z score, based on a raw score of A and a mean and standard deviation for the starting set of and , is given by
The final score for a heavyatom solution is the sum of the Z scores for each of the four criteria. To reduce the likelihood of obtaining a highscoring solution based on just the Patterson, figure of merit or crossvalidation difference Fourier Z scores, the final score is adjusted by subtraction of half the differences between each of these and lowest Z score among them.

When the native Fourier is of low quality, the corresponding score is not of significant utility. To reduce the contribution of the scoring from the native Fourier in cases where it is not expected to be of value, we limit the Z score for the native Fourier to a maximum value depending on the figure of merit of the map. The maximum value is set at the value obtained for cases with the corresponding figure of merit in a series of model calculations we carried out using selenomethionine MAD data and the gene 5 protein atomic model (Terwilliger & Berendzen, 1999; Skinner et al., 1994). These model cases resulted in the approximate relation
where m is the average figure of merit of the phase calculation. That is, for a map with a figure of merit of 0.4, the maximum Z score allowed for this criteria would be just 0.6, while for a map with a figure of merit of 0.6 it could be as high as 2.7.
2.7. Automated MIR and MAD structure determination
Fig. 1 outlines the main steps carried out by the automated `SOLVE' procedure for MIR and MAD These consist of scaling the data, calculation of Patterson functions, finding and optimizing the heavyatom partial structure and calculating native phases and an electrondensity map. The procedures for MAD and MIR data are very similar, except that the MAD data is scaled slightly differently from MIR data and the MAD data is converted to a pseudoSIRAS (single with anomalous scattering) form before looking for the anomalously scatteringatom partial structure. This conversion allows heavyatom which would otherwise be prohibitively slow, to be carried out very quickly by Pattersonbased (Terwilliger & Eisenberg, 1983; Terwilliger, 1994b). Each of these steps is described in detail below.
2.8. Scaling of Xray data sets
The SOLVE procedure begins with integrated scaled or unscaled Xray intensities from several Xray wavelengths (for MAD data) or for native and several heavyatom derivative structures (for MIR data), such as those produced by HKL (Otwinowski & Minor, 1997), MOSFLM (Leslie, 1993) or d*TREK (J. Pflugrath, personal communication). In either the MAD or MIR cases the raw intensities are converted to structurefactor amplitudes, which are brought to a common scale and partially corrected for absorption and decay effects using a local scaling procedure (Matthews & Czerwinski, 1975). The overall strategy for scaling is to minimize systematic errors by scaling F^{+} and F^{} in as similar a fashion as possible and by keeping different data sets separate until after scaling is completed. The scaling procedure used by SOLVE is optimized for cases where data are collected in a systematic fashion so that, for example, the reflections measured for each wavelength of a MAD experiment are nearly identical.
2.9. Scaling of MIR data sets
The scaling of MIR data sets is straightforward in SOLVE. The general approach is to scale the native data, then to use it as a reference dataset for scaling of the F^{+} and F^{} data from each derivative and finally to merge all the data together.
2.10. Scaling native data
Ordinarily, the raw native data suitable for SOLVE analysis consists of one or more individual files each containing measurements of reflection intensities obtained by rotation of a crystal by 180° or less about an axis. In this way, all but a few highresolution F(h,k,l) are present at most once in an individual file, and the data can be handled as if each point on the either has an observation associated with it or not. If the data are collected by rotations of more than 180°, the data can be broken up into smaller files for analysis.
The native data is scaled in three steps. In the first step, a reference data set is constructed from a file containing native data from a single experiment. The reference data set is constructed by localscaling this data set to itself as follows. For each reflection (h,k,l) in the all amplitudes of structure factors equivalent by spacegroup symmetry are averaged to yield a merged reduced data set. This data set is then expanded to the entire using spacegroup symmetry and assuming F(h,k,l) = F(−h,−k,−l). This yields an averaged data set which has exact symmetry. The raw data are then localscaled to this averaged data set. Local scaling is carried out in SOLVE one reflection (h,k,l) at a time. The average structurefactor amplitude for at least 30 reflections symmetrically arranged around (h,k,l) in is obtained using the same (h,k,l) for the raw and averaged data sets. The scale factor applied to F(h,k,l) for the raw data is then the ratio of these averages. The localscaled raw data are then reduced to the and duplicates are averaged to yield a scaled native data set.
The second step in scaling the native data is to place the reference data set on an approximate absolute scale. Setting the absolute scale of the data is helpful for several of the procedures used by SOLVE. For example, if the scale of the data is known, then occupancies of heavyatom sites can be reasonably be expected to be in the range of about 0.1–1.0. The reference data set is placed on a very approximate absolute scale using information on the number of aminoacid residues in the macromolecule (if it is a protein) along with the mean intensity of reflections in the lowest resolution shell. This simple approach is used rather than a Wilson plot (Wilson, 1942) so that the same algorithm can be applied for either lowresolution or highresolution data.
The final step in scaling the native data is to scale all the available native data to the reference data set and then to reduce all the scaled data to the
and merge it into a single native data set.2.11. Scaling of derivative data
Derivative data is scaled to the native data set after first separating the F^{+} data from the F^{} data. The F^{+} and F^{} data are each scaled to the native data set using local scaling. The F^{+} and F^{} data are then reduced to the averaging measurements of equivalent reflections. Finally, two scaled data files are constructed. Each contains the scaled native data and . One also contains F^{+}, , F^{} and for each derivative and the other contains the average amplitude , and the anomalous difference , for each derivative.
2.12. Scaling of MAD data
MAD data is analyzed a little differently from MIR data by SOLVE because there is no native data set to use as a reference for all the data. The general approach used is to combine all available data into one reference data set, then to separate out Bijvoet pairs and to scale each individual F^{+} or F^{} data set to the reference data set. The scaling is performed in two stages, with each individual F^{+} or F^{} data set first scaled to the first data set with an overall scale factor and B factor so as to put all the data sets on the same scale. Then all data in all data sets are merged to the and averaged to form the reference data set. Finally, each individual F^{+} or F^{} data set is scaled to the reference data set with local scaling. This scaling method is used by version 1.10 of SOLVE. Earlier versions (including ones used in this paper) used a more complicated approach, in which each F^{} set of data was first scaled to F^{+} at each wavelength and then all the wavelengths of data were scaled together. The approach described here is now used because it is simpler and yields R factors that are equal to or lower than those obtained with the more complicated approach.
2.13. Calculation of Patterson and difference Patterson functions
SOLVE uses Patterson or difference Patterson functions to generate and evaluate plausible heavyatom solutions in MIR and MAD data sets. In the case of MIR (or SIRAS, single with anomalous scattering) data, the differences between each derivative and the native are used to calculate difference Patterson functions which serve as a starting point for obtaining possible heavyatom partial structures. In the case of MAD data, the multiwavelength data are combined to yield Bayesian estimates of the amplitude F_{A} and relative phase α of the corresponding to the anomalously scattering atoms. These structurefactor amplitudes are in turn used to calculate a corresponding to the partial structure of the anomalously scattering atoms (MADBST; Terwilliger, 1994a). Additionally, the multiwavelength data are used to generate a pseudoSIRAS data set which is then treated just like an SIRAS data set until the final stage of phase calculation (MADMRG; Terwilliger, 1994b).
2.14. Solving the heavyatom structure
The core of the SOLVE algorithm is the identification and optimization of the heavyatom (or anomalously scatteringatom) locations, occupancies and thermal parameters. MAD and MIR data sets are treated identically for this part of This is possible because MAD data has been converted to pseudoSIR data with anomalous differences in the previous step. In either the MAD or MIR cases, the available data consist of a for each `derivative' (where there is a single `derivative' for MAD data) and scaled data for a `native' and one or more `derivatives'.
Fig. 2 illustrates the approach used by SOLVE for determining the heavyatom structure. The procedure begins by generating a few likely partial solutions to the heavyatom structure which are then used as `seeds' to generate more complete solutions. The generation of seeds is carried out by construction of a list of trial partial solutions for the heavyatom structure using HASSP (Terwilliger et al., 1987), followed by and scoring of each trial solution. The top seeds (typically five) are then used in the generation of new trial solutions by addition and of sites identified by difference Fourier analysis, subtraction of sites and by inversion. The last step is carried out iteratively until no further improvement is obtained. The scoring procedure described above is used to identify those trial solutions which are likely to be correct, and at each stage a group of solutions with high scores is maintained.
2.15. Obtaining potential seeds using HASSP
Trial partial solutions (`seeds') for the heavyatom structure can be input directly to SOLVE, but are generally obtained by analysis of the using the automated procedure HASSP (Terwilliger et al., 1987). This procedure uses the (Buerger, 1970) for deconvolution of a and it scores solutions based on the likelihood of obtaining the solution by chance. SOLVE then calculates a preliminary score for each of these solutions based on the alone as described above. SOLVE analyzes the Patterson or difference Patterson functions for each of the derivatives which are being considered, and chooses the top solutions from each derivative as potential seeds.
2.16. and scoring of potential seeds
Potential seeds are refined using originremoved Patterson HEAVY (Terwilliger & Eisenberg, 1983). This procedure for heavyatom has three features which are critical to SOLVE. One is that the occupancies, thermal factors and positions can be refined with originremoved Patterson using a single derivative. This means that the MAD data which is converted to a pseudoSIRAS form can be refined effectively. The second feature is that this procedure yields relatively unbiased estimates of occupancies. This is important as it means that occupancies are not systematically overestimated when the data is poor, so that the overall figure of merit is a relatively good indication of the phasing quality. The third important feature is that Pattersonbased is fast, as derivatives are independent of each other and phases only need to be calculated every few cycles. This speed is crucial to the operation of SOLVE because even so as much as 75% of the time running SOLVE is spent on heavyatom and phasing.
as implemented in the programPotential seeds are rejected in the heavyatom
step if the does not yield plausible parameters. For example, any seed for which occupancies of all sites refine to zero, for which coordinates shift by large distances, for which the figure of merit is low (less than 0.01) or for which heavyatom fails for any reason is rejected.Once the heavyatom parameters in a potential seed have been refined, the solution is scored using the four criteria in Table 1. The top group of solutions is then used as seeds in the next step, described below.
2.17. Generating new trial solutions
SOLVE generates new trial solutions in three ways: by addition of sites identified from difference Fourier analysis, by deletion of sites and by inversion. For example, a seed obtained as above is used to calculate native phases, and from these phases difference Fourier maps are calculated for each derivative. In the case of MAD data, the difference Fourier maps are calculated using the native phases along with the F_{A} and α values for the anomalously scattering partial structure estimated from MADBST (Terwilliger, 1994a). The top peaks in the difference Fourier maps are added to the seed one at a time in order to generate new trial solutions. Peaks which are close (typically within about twice the resolution of the data) to an existing heavyatom site or its symmetry equivalent are not considered. New solutions which are equivalent to any solution which has been examined previously from this seed are ignored. Each trial solution is then refined and scored.
Once a solution with a number of heavyatom sites has been constructed, the solution as a whole may contain enough information to show that one or more of the sites included at an early stage are not correct. SOLVE identifies these in several ways. One is that the incorrect sites may refine to zero occupancy during heavyatom and be deleted. Another way is to systematically delete each site in a solution and test whether the solution lacking the site has a higher score than the original. SOLVE ordinarily carries out this deletion procedure on all trial solutions.
Finally, SOLVE attempts to generate additional trial solutions by inversion of all the heavyatom sites in the seed. The reason this is useful is that three of the four scoring criteria will yield identical results for a solution and its inverse even if anomalous differences have been measured (as long as the is not chiral). The Patterson analysis, the crossvalidation difference Fourier analysis and the figure of merit are all independent of the hand of the solution for achiral space groups. Of our four criteria, only the native Fourier analysis can distinguish the hand of the heavy atoms in this case, and then only if anomalous differences are included in the analysis. This means that in early stages of generating the heavyatom solution, where the native Fourier is very noisy and contributes little to the scoring, it is difficult to identify the correct hand of the heavy atoms. Consequently a solution may be built up that is largely correct but has the wrong hand. Therefore, SOLVE tests the inverse of each heavyatom solution in an attempt to generate a solution with the correct hand when anomalous differences are used and the is achiral.
2.18. Restricting the heavyatom search once a promising partial solution is found
If SOLVE does not find any solutions which are very likely to be correct, it begins with each seed in turn and attempts to complete it as described above. On the other hand, if a very promising partial solution is found, SOLVE will just attempt to complete it as quickly as possible and finish. SOLVE uses a simple set of criteria to identify promising solutions. They must have an overall figure of merit of 0.5 or greater and an overall Z score of 10 or greater (that is, it must be about 10 standard deviations above the average score of starting solutions obtained from HASSP). When SOLVE finds such a solution, it no longer generates trial solutions by singlesite deletions and it only keeps the one top solution present at any time (instead of a group of top solutions). Once this solution cannot be further improved by addition of new sites found in difference Fourier analyses, SOLVE once again tests solutions generated both by deletion and addition. When no further improvement is obtained in this way, the highestscoring solution is reported.
2.19. Calculating native phases
Native phases are needed for calculation of electrondensity maps as well as for three of the four criteria used in scoring (crossvalidations, difference Fouriers, figure of merit and analysis of the native Fourier). In all cases, SOLVE uses the `best' rather than `most probable' phases for analysis (Blundell & Johnson, 1976). For MIR data, Bayesian correlated phasing (Terwilliger & Berendzen, 1996) is used at all stages of SOLVE operation. This phasing approach automatically takes into consideration any correlated nonisomorphism or errors in the derivative data. For MAD or SIRAS data, phasing during the heavyatom solution phase of SOLVE operation is carried out using a standard approach as implemented in the program HEAVY (Terwilliger & Eisenberg, 1983). For MAD data, this phasing method is much more rapid than a more complete treatment of the phasing would be (e.g. Terwilliger & Berendzen, 1997; de la Fortelle & Bricogne, 1997) and is useful in speeding up the operation of SOLVE. Once a final solution has been obtained by SOLVE, however, phases are calculated for MAD data using Bayesian correlated (Terwilliger & Berendzen, 1997), an approach which uses all the original MAD data and includes correlations of errors among the data collected at different wavelengths.
2.20. Output of SOLVE
The final output of the SOLVE algorithm consists of an electrondensity map (in newezd format compatible with O; Jones et al., 1991), which can be imported into the CCP4 suite (Collaborative Computational Project, Number 4, 1994) using the routine mapman, a file containing native structurefactor amplitudes, phases and Hendrickson–Lattman coefficients (Hendrickson & Lattman, 1979), which can be imported into the CCP4 suite using f2mtz, and a command file which can be modified and used to run SOLVE and calculate phases or generate additional heavyatom sites.
2.21. Generation of model Xray data sets
SOLVE can model raw Xray data for either MIR or MAD in which the macromolecular structure is defined by a file in PDB format (Bernstein et al., 1977) and heavyatom parameters are specified by the user. The generate feature allows any degree of `experimental' uncertainty in measurement of intensities. It also allows limited nonisomorphism for MIR data in which cell dimensions differ for native and any of the derivative data sets (but in which the macromolecular structure is identical).
Once a data set has been generated, the SOLVE algorithm then can be applied to the data set in an attempt to solve it. SOLVE can calculate an electrondensity map based on the structure input in PDB format and evaluate the of this map with the maps that it generates during the structuredetermination process. For heavyatom solutions with the inverse hand, this comparison is of course not possible. For heavyatom solutions which are related to a different origin than the correct solution, the origin shift is automatically determined by SOLVE by finding the origin shift which leads to the closest correspondence of heavyatom sites in the trial and correct solutions. We use this as an objective measure of the quality of a heavyatom solution and as a basis for evaluating the utility of our four scoring criteria.
Model data sets were constructed using the `generate' feature of SOLVE, using two different model proteins. One model protein consisted of coordinates from a dehalogenase enzyme from Rhodococcus species ATCC 55388 (American Type Culture Collection, 1992), determined recently in our laboratory, containing 316 aminoacid residues and crystallizing in P2_{1}2_{1}2 with cell dimensions a = 94, b = 80, c = 43 Å (J. Newman, personal communication). The other was based on the gene 5 protein structure in C2 with cell parameters a = 76, b = 28, c = 42 Å, β = 103° (PDB entry 1bgh; Skinner et al., 1994). For the MIR data `experimental' uncertainties of 3–5% (on intensity) and variation in cell dimensions of 1% from crystal to crystal were used. For the MAD data uncertainties of 2–4% were used. The dehalogenase model was used to generate 132 MIR data sets consisting of a native crystal and two derivative crystals. Each MIR data set contained 6–10 Hg or Au heavyatom sites with `occupancies' of 0.4–2.6 and thermal factors of 30–50 Å^{2} (although the higher values of `occupancy' are not realistic for this structure, they are included to simulate the effects of a full occupancy Hg or Au in a smaller structure). The gene 5 protein model was used to generate 287 MAD data sets with 4–8 selenomethionine sites with `occupancies' of 0.6–1.4 and thermal factors of 30–50 Å^{2}. All the data sets were generated including anomalous differences. During the course of each trial solutions were scored using the four criteria in Table 1. The Z scores for each trial solution and the correlation coefficients of trial and correct electrondensity maps were recorded for all trial solutions which had the correct hand. Those that had the opposite hand were not considered, as our simple correlationcoefficient measure of the actual quality of solutions was not applicable.
3. Results
3.1. A scoring system for evaluating heavyatom partial structures in the MAD and MIR methods
The approach we have taken for evaluating MIR heavyatom (or anomalously scattering atom in the MAD method) partial structures is to quantify criteria that have been applied in a qualitative fashion for some time in the MIR and MAD approaches. The first criteria (Table 1) is the match between the and the interatomic vectors predicted from the trial heavyatom structure (Blundell & Johnson, 1976). The second consists of the peak heights at heavyatom positions in `crossvalidation' difference Fourier maps. These are calculated by using all but one heavy atom in phasing. The peak height at the position of the deleted atom is a measure of the selfconsistency of the heavyatom solution (Dickerson et al., 1961). The third criteria we use is simply the figure of merit of phasing (Blundell & Johnson, 1976). This is a measure of the precision of the phases obtained. The final criteria is the existence of well defined regions containing solvent and macromolecule in the native electrondensity map (Terwilliger & Berendzen, 1999). These criteria are described in detail in §2.
3.2. Evaluating scoring criteria using SOLVE to generate and analyze model data
To evaluate the scoring criteria illustrated in Table 1 and to test the overall SOLVE algorithm, model data were constructed using the `generate' feature of SOLVE based on crystal structures of a dehalogenase enzyme (J. Newman, personal communication) and gene 5 protein (Skinner et al., 1994). The SOLVE structuresolution algorithm was then applied to these model data sets and the utility of the scoring criteria was evaluated by comparing them with the between maps calculated by SOLVE during and model maps.
3.3. Evaluating SOLVE scoring criteria
Each of our four scoring criteria was evaluated for a series of 419 model structure determinations using the
between correct and trial electrondensity maps as a measure of the actual quality of each solution. The purpose of this comparison is to evaluate whether the four scoring criteria are useful in differentiating between solutions which lead to a map of high quality and those which do not.Fig. 3 shows the Z scores for each scoring criterion for one of the 419 test cases (based on the dehalogenase and gene 5 protein structures) as a function of the quality of the solutions (the of the corresponding electrondensity map to the model map). As expected, the Z scores for each criterion generally increase with increasing correlation coefficients between model and trial maps. The relationship between and Z scores differs considerably from one criterion to another, however. The Z scores for agreement with the increase gradually over the range of correlation coefficients. In contrast, the Z scores for crossvalidation Fourier analyses are nearly constant over the range of correlation coefficients from 0 to 0.25, but then increase at a much greater rate than the Patterson scores.
Fig. 3 indicates that any of the four criteria we have selected would have some use in evaluating the relative quality of different trial solutions, but that the different criteria have slightly different behavior at different stages of In particular, the Patterson analysis and crossvalidation Fourier analyses appear to be of the most use for solutions with correlation coefficients in the range 0.3–0.4, while the analysis of the native Fourier appears to be the strongest criterion for identification of correct solutions with correlation coefficients above this range.
One way to illustrate the predictive power of each criterion is to evaluate its ability to determine which of two possible solutions that differ in quality by a certain amount (e.g. 0.05 units of between model and trial maps) is of a higher quality. This ability is central to the SOLVE algorithm, which maintains a ranked list of top solutions at any one time. This probability can be estimated from Fig. 3 by determining the percentage of cases where the solution with the higher has a higher score. Pairwise comparisons of solutions which differed by 0.05 units in were used in this analysis. Fig. 4 shows a plot based on all 419 test structure determinations which illustrates this probability where all the pairwise comparisons are within the same For solutions of poor quality (with correlations between model and trial maps of less than about 0.1) all of the criteria had only about a 50% chance of identifying the better solution in a pairwise comparison. In contrast, for solutions with better quality (with correlations between model and trial maps of about 0.3–0.5), each scoring criteria had considerable utility in identifying the better solution. Comparison of a solution with the allowed a correct identification in about 60% of the cases. The figure of merit could be used to make this distinction in about 75% of the cases. The crossvalidation difference Fourier was correct in about 80–85% of cases, and analysis of the native Fourier map each could be used in 75–95% of cases to identify the better solution. The overall Z score was nearly as good as the best of the four individual criteria over the entire range of map quality. Therefore, it appears to be a reasonable overall measure of the quality of a solution.
After the SOLVE algorithm is applied to a it is useful to have an idea of whether the top solution that it has found is likely to actually represent a correct solution. Fig. 5 shows the overall score and to the model map of the top solutions found in each of the 419 model structure determinations we carried out. In 180 of the 419 structure determinations shown in Fig. 5, SOLVE was able to obtain an electrondensity map with a to the model map of 0.2 or greater. Fig. 5 indicates that in this set of teststructure determinations with 4–10 heavyatom sites those solutions with overall Z scores of greater than 20 were nearly always correct. Those with scores in the range of about 10–20 were sometimes correct and sometimes not, and those with scores less than 10 were rarely correct. It should be noted that although these results with model data give a general idea of the range of scores which are associated with maps of various qualities, the relationship between map quality and overall scores is likely to be dependent on the details of the Consequently, Fig. 5 should be used only as a rough guide to the likely quality of a solution.
3.4. Application of SOLVE to experimental MAD and MIR data
SOLVE has now been used to determine many MIR and MAD structures, with two of the largest structures consisting of MAD structures with 26 and 52 selenomethionine residues in the respectively (S. Ealick, personal communication; W. Smith & C. Janson, personal communication). A test MAD (with 15 selenomethione sites in the asymmetric unit) and an actual MIR structure deterimination (with five derivatives, each containing 2–4 heavyatom sites) are illustrated here to evaluate the application of SOLVE to experimental data.
3.5. MAD structure determination
A fourwavelength MAD data set collected on βcatenin (Huber et al., 1997) was used to test SOLVE on MAD data. This structure was originally solved using RSPS (Knight, 1989), but it was a good test case because of the large number of selenomethione residues (15) in the protein and the availability of a refined structure for comparison. The was C222_{1} with unitcell dimensions of a = 64, b = 102, c = 187 Å. Scaled MAD data (17000 observations to a resolution of 2.7 Å) was converted to intensity data. This reflection information was input to SOLVE along with the approximate number of aminoacid residues in the protein (700), the number of expected selenium sites (15) and estimates of the scattering factors for selenium (SOLVE can refine the values of the scattering factors if they are not known accurately). Default values were used for all other parameters. SOLVE identified a single solution with 12 selenium locations. All 12 selenium locations as well as the hand of the solution were correct. The additional selenium sites used in the original included one with a thermal factor of 85 Å^{2} and two with partial occupancies in the refined structure (Huber et al., 1997). The overall figure of merit of the was 0.67 and the overall Z score of the solution was 54. SOLVE required approximately 4 h on a 500 MHz DEC Alpha workstation to find this solution, and three additional hours to verify that no similar solutions would yield higher overall scores.
The hand of the selenium partial structure was identified by SOLVE using the analysis of the native Fourier map. The Z score for analysis of the native Fourier for the correct hand was 4.7 (i.e. the final solution had a score 4.7 standard deviations above the starting set of trial solutions), while that of the inverse hand was only 0.5. The utility of this analysis of the native Fourier map is illustrated in Fig. 6, which shows sections through the native Fourier calculated by SOLVE using 11 selenium sites with either the correct or inverted hands. The map with the correct hand has features expected of a protein: regions which are flat (solvent) and other regions which have high variation (the protein). In contrast, the map calculated with an inverted set of selenium sites has a very uniform level of variation throughout and does not have the appearance expected of a protein crystal.
Fig. 7 shows a section of electron density from the map calculated by SOLVE and coordinates from the refined model of βcatenin (with an origin shift so that the selenium sites used in the original match the ones obtained by SOLVE). The electrondensity map is of high quality and is readily interpretable.
3.6. MIR with SOLVE
SOLVE was recently used in the of a dehalogenase enzyme from Rhodococcus strain ATCC 55388 (J. Newman, personal communication). This protein crystallized in P2_{1}2_{1}2 with cell parameters of a = 94, b = 80, c = 43 Å, and MIR data was collected to a resolution of 2.5 Å on the native and five derivatives. Raw unmerged data produced by HKL (Otwinowski & Minor, 1997) was input to SOLVE, along with the identities of the heavy atoms in each derivative, a limit of five heavyatom sites per derivative and the estimated number of aminoacid residues in the (250). SOLVE identified between two and four heavyatom sites in each derivative and calculated the electrondensity map illustrated in Fig. 8, which has an overall figure of merit of 0.69. The map is of excellent quality and is readily interpretable. For the actual structure solution, this map was further improved by solvent flattening (Abrahams et al., 1994). This required approximately 4 h to obtain and 1 h to check using a 500 MHz DEC Alpha workstation.
4. Conclusions
We have found the SOLVE algorithm to be exceptionally useful in determining macromolecular structures based on MIR and MAD Xray data, both because of its simplicity of use and in the thoroughness of its search for heavyatom solutions. The simplicity of using SOLVE is largely made possible by the development of quantitative measures of the quality of heavyatom solutions, allowing the determination of the heavyatom structure to be transformed from a decisionmaking problem with incompletely defined criteria to a straightforward optimization problem. Simplicity of use is also made possible by choosing default parameters which are applicable to a wide variety of situations, so that in most cases it is not necessary for the user to adjust them. The incorporation of robust yet standardized methods for scaling is also important for the ease of use of SOLVE, as it is therefore able to begin with raw data files containing integrated intensities and scale MIR or MAD data without manual intervention.
The thoroughness of the search for heavyatom solutions is an important feature of SOLVE. In the MIR method, a search for a `good' (usually singlesite) derivative with which to find the heavyatom sites in all the other derivatives is often a timeconsuming and difficult stage in In this process, tools such as RSPS (Knight, 1989) or HASSP (Terwilliger et al., 1987) are often used to generate plausible solutions to a difference These solutions must then be individually checked for their agreement with the Patterson and their ability to contribute to phasing the native data and to identify heavyatom sites in other derivatives. As the process is often slow and involved, only a small number of solutions usually can be tested. Because SOLVE is automated, it is now practical to test many more starting solutions and to follow each one through, building up complete trial MIR solutions which can be evaluated relative to each other using the objective SOLVE scoring system. Using this scoring system, the correctness of each individual heavy atom in the solution can also be checked by deleting it and reevaluating the score of the solution.
One of the most important features of SOLVE is its ability to evaluate the quality of an electrondensity map during the structuredetermination process and to use this as part of the evaluation of each trial heavyatom solution. When MIR or MAD heavyatom structures are determined using either the (Terwilliger et al., 1987; Chang & Lewis, 1994) or by (Sheldrick, 1990; Miller et al., 1994), structurefactor amplitudes corresponding to the heavyatom partial structure are extracted from the raw data. Because of this separation of heavyatom structure factors from total structure factors, information contained in the original structure factors which could be used to solve the heavyatom partial structure is ignored. In particular, only after the heavyatom partial structure is `solved' is a native Fourier calculated and visually examined. In contrast, SOLVE is able to evaluate potential heavyatom solutions both with respect to their agreement with the and with respect to the qualities of the resulting native Fourier, crossvalidation difference Fourier and figure of merit. The examination of the native Fourier not only yields information on the overall quality of a solution but also can often positively identify the hand of the heavyatom solution when anomalous differences have been measured. The incorporation of these different sources of information about the quality of heavy atom solutions allows SOLVE to use more of the information present in a MAD or MIR experiment than has previously been possible during the process of structure determination.
SOLVE is fundamentally different from other software used for MIR and MAD structure determinations because of its incorporation of quantitative measures of the quality of a solution and because of its complete automation. Other packages such as PHASES (Furey & Swaminathan, 1997), HEAVY (Terwilliger & Eisenberg, 1987; Terwilliger & Berendzen, 1996) or SHARP (de la Fortelle & Bricogne, 1997) can carry out all the steps necessary to determine a structure by MIR or MAD, but they do not provide the range of objective and quantifiable measures of the quality of a potential solution that SOLVE does. A user must for example evaluate a native Fourier map visually to assess whether a solution is likely to be correct. Because of its ability to provide quantitative measures of the quality of a solution, SOLVE is both able to provide the user with useful criteria for comparing solutions when the user wishes to be closely involved in decision making in the structuredetermination process, and it is able to carry out the entire process without any input at all.
We anticipate that SOLVE will be of significant use not just in MAD and MIR structure determinations carried out onebyone as they are today, but also in more highthroughput applications which are now being widely discussed. Because of the automation and ease of use of SOLVE, it has already been used in several instances to solve a structure within a few hours of the data being collected (R. Fahrner & D. Eisenberg, personal communication; R. Stevens, personal communication). It seems reasonable to imagine a largely automated process of at synchrotron sources beginning with MAD data collection (e.g. on selenomethioninecontaining crystals) and continuing through data processing and structure solution at least as far as calculation of an electrondensity map. With further development of automated model building and (Zou & Jones, 1996), the entire process of and model building and might be automated for straightforward cases. For more complicated cases which cannot be solved automatically, the quantitative evaluation of heavyatom solutions carried out by SOLVE is likely to be an important tool for the macromolecular crystallographer in structure determination.
Complete documentation of the SOLVE software and information on obtaining the program are available on the internet at http://www.solve.lanl.gov .
Acknowledgements
We would like to thank J. Newman for many helpful suggestions for improvements in SOLVE and for use of the dehalogenase data and A. Huber and W. Weis for the use of the βcatenin data. We are especially grateful to the many users of SOLVE who have provided feedback and helpful comments. We would like to thank the DOE, the NIH and the LaboratoryDirected Research and Development Program of Los Alamos National Laboratory for generous support.
References
Abrahams, J. P., Leslie, A. G. W., Lutter, R. & Walker, J. E. (1994). Nature (London), 370, 621–628. CrossRef CAS PubMed Web of Science Google Scholar
American Type Culture Collection (1992). Catalogue of Bacteria and Bacteriophages, 18th ed., pp. 271–272. Google Scholar
Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). J. Mol. Biol. 112, 535–542. CrossRef CAS PubMed Web of Science Google Scholar
Blundell, T. L. & Johnson, L. N. (1976). Protein Crystallography, p. 368. New York: Academic Press. Google Scholar
Buerger, M. J. (1970). Contemporary Crystallography. New York: McGrawHill. Google Scholar
Chang, G. & Lewis, M. (1994). Acta Cryst. D50, 667–674. CrossRef CAS Web of Science IUCr Journals Google Scholar
Collaborative Computational Project, Number 4 (1994). Acta Cryst. D50, 760–763. CrossRef IUCr Journals Google Scholar
Fortelle, E. de la & Bricogne, G. (1997). Methods Enzymol. 277, 472–494. Google Scholar
Dickerson, R. E., Kendrew, J. C. & Strandberg, B. E. (1961). Acta Cryst. 14, 1188–1195. CrossRef CAS IUCr Journals Web of Science Google Scholar
Furey, W. & Swaminathan, S. (1997). Methods Enzymol. 277, 590–620. CrossRef PubMed CAS Web of Science Google Scholar
Hendrickson, W. A. & Lattman, E. E. (1979). Acta Cryst. B26, 136–143. CrossRef IUCr Journals Google Scholar
Hendrickson, W. A. & Ogata, C. M. (1997). Methods Enzymol. 276, 494–523. CrossRef CAS Web of Science Google Scholar
Huber, A. H., Nelson, W. J. & Weis, W. I. (1997). Cell, 90, 871–882. CrossRef CAS PubMed Web of Science Google Scholar
Jones, T. A., Zou, J. Y., Cowan, S. W. & Kjeldgaard, M. (1991). Acta Cryst. A47, 110–119. CrossRef CAS Web of Science IUCr Journals Google Scholar
Ke, H. (1997). Methods Enzymol. 276, 448–461. CrossRef CAS Web of Science Google Scholar
Knight, S. (1989). PhD thesis. Swedish University of Agricultural Sciences, Uppsala, Sweden. Google Scholar
Leslie, A. G. W. (1993). Proceedings of the CCP4 Study Weekend, edited by L. Sawyer, N. Isaacs & S. Bailey, pp. 44–51. Warrington: Daresbury Laboratory. Google Scholar
Matthews, B. W. & Czerwinski, E. W. (1975). Acta Cryst. A31, 480–487. CrossRef CAS IUCr Journals Web of Science Google Scholar
Miller, R., Gallo, S. M., Khalak, H. G. & Weeks, C. M. (1994). J. Appl. Cryst. 27, 613–621. CrossRef CAS Web of Science IUCr Journals Google Scholar
Otwinowski, Z. & Minor, W. (1997). Methods Enzymol. 276, 307–326. CrossRef CAS Web of Science Google Scholar
Pennisi, E. (1998). Science, 279, 978–979. Web of Science CrossRef CAS PubMed Google Scholar
Rost, B. (1998). Structure, 6, 259–263. Web of Science CrossRef CAS PubMed Google Scholar
Shapiro, L. & Lima, C. D. (1998). Structure, 6, 265–267. Web of Science CrossRef CAS PubMed Google Scholar
Sheldrick, G. M. (1990). Acta Cryst. A46, 467–473. CrossRef CAS Web of Science IUCr Journals Google Scholar
Skinner, M. M., Zhang, H., Leschnitzer, D. H., Bellamy, H., Sweet, R. M., Gray, C. M., Konings, R. N. H., Wang, A. H.J. & Terwilliger, T. C. (1994). Proc. Natl Acad. Sci. USA, 91, 2071–2075. CrossRef CAS PubMed Web of Science Google Scholar
Terwilliger, T. C. (1994a). Acta Cryst. D50, 11–16. CrossRef CAS Web of Science IUCr Journals Google Scholar
Terwilliger, T. C. (1994b). Acta Cryst. D50, 17–23. CrossRef CAS Web of Science IUCr Journals Google Scholar
Terwilliger, T. C. & Berendzen, J. (1996). Acta Cryst. D52, 749–757. CrossRef CAS Web of Science IUCr Journals Google Scholar
Terwilliger, T. C. & Berendzen, J. (1997). Acta Cryst. D53, 571–579. CrossRef CAS Web of Science IUCr Journals Google Scholar
Terwilliger, T. C. & Berendzen, J. (1999). Acta Cryst. D55, 501–505. Web of Science CrossRef CAS IUCr Journals Google Scholar
Terwilliger, T. C. & Eisenberg, D. (1983). Acta Cryst. A39, 813–817. CrossRef CAS Web of Science IUCr Journals Google Scholar
Terwilliger, T. C. & Eisenberg, D. (1987). Acta Cryst. A43, 6–13. CrossRef CAS Web of Science IUCr Journals Google Scholar
Terwilliger, T. C., Kim, S.H. & Eisenberg, D. (1987). Acta Cryst. A43, 1–5. CrossRef CAS Web of Science IUCr Journals Google Scholar
Terwilliger, T. C., Waldo, G., Peat, T. S., Newman, J. M., Chu, K. & Berendzen, J. (1998). Protein Sci. 7, 1851–1856. CrossRef CAS PubMed Google Scholar
Vagin, A. & Teplyakov, A. (1998). Acta Cryst. D54, 400–402. Web of Science CrossRef CAS IUCr Journals Google Scholar
Wilson, A. J. C. (1942). Nature (London), 150, 152. CrossRef Google Scholar
Zou, J. Y. & Jones, T. A. (1996). Acta Cryst. D52, 833–841. CrossRef CAS Web of Science IUCr Journals Google Scholar
This is an openaccess article distributed under the terms of the Creative Commons Attribution (CCBY) Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original authors and source are cited.