research papers
Lowresolution phase extension using wavelet analysis
^{a}Department of Physics, York University, Heslington, York, YO10 5DD, England, and ^{b}Department of Chemistry, York University, Heslington, York, YO10 5DD, England
^{*}Correspondence email: julie@ysbl.york.ac.uk
A method to extend lowresolution phases is presented which uses histogram matching not only of the electron density, but also of histograms obtained from the different levels of detail provided by the wavelet transform of the electron density. Statistical values for the wavelet coefficients can be predicted and depend only on the resolution and solvent content. Therefore, new details can be added to an electrondensity map by matching the values of the wavelet coefficients to those predicted for an increased resolution. The positions of the new details are also guided by the diffraction pattern. In this way, the resolution can be increased gradually; on a number of trial structures of different size, solvent percentage and
it has been possible to extend the phasing from 10 Å to around 6–7 Å.Keywords: phase extension; wavelet analysis.
1. Introduction
An electrondensity map which can be interpreted in terms of an atomic model requires both the phases and amplitudes of the structure factors. However, only the diffraction amplitudes can be measured experimentally, giving rise to the socalled `phase problem' in crystallography. The standard methods of multiple ab initio determination of macromolecular crystal structures using Xray diffraction data from a single crystal may be divided into the following three stages.
(MIR) and multiplewavelength (MAD) can provide the required phases, although isomorphous derivatives or anomalous scatterers are not always easily obtained. Similarly, there is often no suitable starting model for structure solution by and a phasing scheme is required which does not depend upon any special circumstances or conditions. A method for the complete

2. Wavelet analysis
Suppose we have the onedimensional electron density ρ(x), shown in Fig. 1(a). Then, as we are only dealing with the values at grid points, we lose nothing by assuming that ρ(x) is constant between the grid points, as in Fig. 1(b).
Define the function φ(x), shown in Fig. 2, by
Then, since φ(x − k) is only nonzero for k ≤ x < k + 1, we can write ρ(x) as a linear combination of these functions, i.e. we have
where n is the number of grid points and A_{0,k} is just the value of ρ at the kth grid point. If we now scale the function φ(x) by a factor of 2, then we can only obtain the approximation to the electron density given by
as shown in the lefthand side of Fig. 3.
However, the differences between the approximation in (2) and the exact representation in (1) can be expressed in terms of the function ψ(x), shown in Fig. 4, defined by
so that
Thus, if we add the two parts of Fig. 3, we again have an exact representation of our electron density (up to the sampling precision). We can now repeat the process on the `smoothed' density in Fig. 3 to obtain an even smoother version in terms of the functions φ(x/4 – k). Now, if we add two different levels of details we can again recreate our original density exactly, i.e. we have
The function ψ(x) is the simplest wavelet function, known as the Haar wavelet. The function φ(x) is known as the Haar scaling function as it satisfies the equation
(with c_{0} = c_{1} = 1), which tells us the relationship between the functions on one scale and those which are twice as wide. This may seem too obvious to express in this way, but the scaling equation in (5) holds for more complicated wavelet systems. In all cases, we represent our density as a linear combination of some waveletscaling function φ and its integer translates as in (1). The scaling equation (5) then allows approximations of the original density by functions scaled by a factor of 2. At each level, the differences between the approximations are stored in terms of wavelet functions ψ, related to φ through
In the usual Fourier representation, we express the electron density in terms of sines and cosines and each one contributes to every point in the map. In a wavelet representation, the functions φ and ψ are compactly supported; that is, each function is nonzero for a finite number of grid points. For example, the Haar scaling function is nonzero for a single grid point at the original scale. The number of coefficients c_{j} in (5) is twice the number of nonzero grid points at the original scale. (The equation changes the scale by a factor of 2.) A set of wavelet functions is completely characterized by these coefficients, called filter coefficients to distinguish them from the wavelet coefficients A_{i,k} and B_{i,k} above. In fact, certain families of wavelet functions devised by Daubechies (1992) are referred to by the number of these coefficients. Daubechies' wavelets cannot be expressed as analytical formulae but are instead formed by an iterative procedure via (5). Daubechies' sixcoefficient wavelet function, used in the phaseextension method, is shown in Fig. 6.
We have described the multiresolution approach to wavelet analysis of Mallat (1989). In this case, the wavelet functions are specically constructed to give an exact representation at any level of the transform in terms of the functions φ and ψ associated with that level. Wavelettype functions have also been considered in the crystallographic context by Lunin (2000) to provide an approximation to the electrondensity distribution requiring few parameters.
3. Histogram matching
Histogram matching is a standard technique in image processing and has been used with success in Xray crystallography. A histogram of density values can be calculated from any discrete image and compared with the histogram expected from a `good' image. In the ab initio determination of molecular envelopes, this is used as a selection criteria in choosing phase sets from a large population of random starting phase sets (Lunin & Skovoroda, 1991). Furthermore, an image may be improved by systematically changing the density values to match the expected histogram; electrondensity histogram matching plays an important role in density modification at high resolution (Zhang & Main, 1990). The technique depends on the availability of expected histograms and it has been shown that these can be predicted for unknown protein structures. At high resolution, the protein and solvent regions are easily identified and the histogram matching is only performed over the protein region. In this case the histograms depend only on resolution (Zhang, 1993). At very low resolution, the electron density from the entire must be used and molecular packing as well as the percentage of solvent play a part in determining the histogram. However, it has been found that only considering the reduces the dependence on packing arrangements sufficiently and only resolution and solvent content need to be addressed (Main, 1998).
Mathematical models can be used to describe the histograms and Main (1990) provided a formula for calculating electrondensity histograms at high resolution. Similarly, Lunin & Skovoroda (1991) suggested a twocomponent histogram model, calculated empirically to correspond with the histograms of known protein structures. At low resolution, the electron density from the entire cell must be included in the histogram, giving a as shown in Fig. 7(a). Fig. 7(b) shows that the histograms can be described as a sum of two Gaussian functions, one which corresponds roughly to the solvent and one to the protein region.
The different levels of detail provided by the wavelet transform provides additional constraints on the electron density and allows histogram matching to be exploited further. For each of the x, y and z directions, two levels of a wavelet transform are performed, giving sets of wavelet coefficients A_{2,k}, B_{2,k} and B_{1,k} as in (4). The A_{2,k} are the coefficients of the scaling functions and relate to the smooth version of the electron density. It is not surprising then that the histograms calculated from these coefficients look like lowresolution electrondensity histograms. The histograms calculated from the detail coefficients B_{2,k} and B_{1,k}, however, look very similar to detail histograms obtained by Mallat (1989) for twodimensional images. Mallat showed that these histograms can be described by the mathematical model
The parameters α and β can be determined from the first and second moments of the histogram (see Wilson & Main, 2000). Fig. 8 shows typical examples of detail histograms together with histograms calculated from Mallat's model; the histogram labelled D_{1} is calculated from the coefficients B_{1,k} and the histogram labelled D_{2} from the coefficients B_{2,k}. As the parameters depend only on resolution and percentage of solvent, we have been able to predict values for α and β as functions of resolution for different solvent contents. These histograms have been shown to work as efficiently as histograms calculated using the actual values of the coefficients.
4. Extending lowresolution phases
To form a starting point for the method, we have assumed that phases to 10 Å are available; an electrondensity map is calculated with these. A twolevel wavelet transform is then performed along the x direction to obtain sets of wavelet coefficients A_{2,k}, B_{2,k} and B_{1,k}. The detail coefficients are then matched to values predicted for a slightly higher resolution, but the coefficients of the scaling functions A_{2,k} are left unchanged. These coefficients provide the smoothed electron density to which the details are added and it has been found that not matching these coefficients allows them to keep their position and stabilizes the procedure. The inverse wavelet transform is then performed using the new coefficients and the electron density obtained is matched to the histogram predicted for it at the new resolution. As the wavelet transform is applied to the entire cell, the symmetry is destroyed and this is now reimposed and the difference between symmetryrelated pixels is used to check the correctness of the map. The process is repeated until this difference is within preset limits. Wavelet transforms are also performed along the y and z directions (on the electrondensity map from the start of the cycle) in the same manner to obtain three maps at the new resolution to be averaged. The Fourier magnitudes calculated from this averaged map are then compared with the observed magnitudes and an R factor is computed. The calculated magnitudes are replaced by the observed magnitudes and a new map calculated on which the whole process is repeated until the R factor shows convergence. This completes a single cycle of the phase extension as shown in the flowchart of Fig. 9. At this point, the resolution is increased and the procedure is repeated on the electron density calculated using the original 10 Å phases together with the new phases obtained in the previous cycle.
The twoGaussian model used to describe the electrondensity histograms allows systematic changes and improves the results. Decreasing the variance of the solvent distribution effectively leads to solvent flattening without the need to predesignate an area as solvent. Similarly, it has been found that sharpening the electron density in the protein region can be achieved by increasing the variance associated with the protein distribution.
5. Results
The method has been shown to work on a large number of model structures varying in size, solvent content and shows the cumulative phase errors for a variety of proteins selected from the Protein Data Bank (Berman et al., 2000) for which structure factors were calculated from the atomic coordinates. There is a gradual buildup of phase errors as the calculation proceeds, but when calculated phases to 10 Å resolution are used we are currently able to extend these to a resolution of around 6–7 Å. In most cases, the 10 Å electrondensity map is little more than a mask roughly covering the molecule (Fig. 10a), whereas in a 6 Å map it is often possible to identify secondary structure. Fig. 10(b) shows the 6 Å map after phase extension from 10 Å, where the mean error on the new phases is 73°. This can be compared with Fig. 10(c) which shows the 6 Å map calculated from the atomic coordinates.
Table 1
‡Number of independent new reflections, 10–6 Å. 
The method has also been used with experimentally measured magnitudes and we have found that the method works equally well if the data are complete. Fig. 11 compares the results from calculated and experimental magnitudes and it can be seen that missing lowresolution data causes major problems. If, however, the experimental magnitudes are used where available and the missing data are replaced by appropriately scaled calculated magnitudes, the results are comparable to those obtained using only calculated magnitudes. All the graphs in Fig. 11 show the results after phase extension where the 10 Å starting phases were calculated from the atomic coordinates. However, the method has also been shown to work when the starting phases have errors. A polyalanine model was created from an αsubunit of human haemoglobin and positioned in the myoglobin using the molecularreplacement program AMoRe (Navaza, 1990). The 10 Å starting phases calculated from this model have a mean phase error of 46° with respect to those calculated from the atomic coordinates of myoglobin. Again, the experimental magnitudes were used where available and calculated magnitudes were used for the missing reflections. Fig. 12 shows the mean phase errors for the new phases in comparison with those obtained from perfect starting phases (i.e. those calculated from the final model).
6. Phase refinement
As well as constraining the density by histogram matching and wavelet analysis, we are experimenting with applying constraints to the structurefactor phases. Some constraints are more easily applied in real space than vice versa. We intend to exploit both. The assumption that the structure consists of randomly positioned atoms leads to the phase probability distribution
andwhere
and
with N equal atoms in the This formula is the basis of the highly successful of determination, which are successful at least when N is not too large and atomic resolution data are available.
For most proteins at less than atomic resolution, the formula gives little useful information. However, if more is known about the structure, it is possible to obtain new phase probability formulae that could be useful. For example, if the molecular envelope is known, the atoms can be assumed to be randomly distributed within the envelope instead of over the complete cell. This leads to the phase probability formula
where
with (h) { = (h)exp[iΦ(h)]} being the Fourier coefficients of the volume containing the random atom distribution normalized to (0) = 1. Note that (8) indicates a most probable value for φ(h, k) of Φ(h, k), which is given by the phases of the appropriate s. A combination of phase distributions gives a modified tangent formula which can express one phase in terms of a large number of others,
where the weight w(h, k) takes account of the expected errors in the phases φ(h − k) and φ(k). With a small enough volume for the random atoms, this formula can give useful phase information for proteins, even at low resolution. Preliminary tests have achieved a phase improvement at 7 Å resolution from 62 to 53° for 480 strong reflections, using only a knowledge of the molecular envelope at 10 Å resolution.
If part of the secondary structure has been recognized from previous maps or a model for part of the structure is available, this can be included in a phase probability formula very similar to (8), but with
where
and the same notation as given for (7) is used. There are m atoms in known positions, n randomly distributed (unknown) atoms and a total of N atoms in the cell. All atoms are assumed to be equal.
7. Discussion
In order to test the method when the starting phases have errors, we have used models which were suitable for αsubunit of haemoglobin was used to provide phases for myoglobin. Although the phase error is as high as 46°, the models used are of sufficient quality for structure solution by the normal molecularreplacement method. However, since we only require phases to 10 Å, the method could be used in cases where the model is much poorer and the usual procedure fails. In particular, starting phases may be obtained from electronmicroscopy images and work is in progress on an as yet unsolved protein for which 10 Å phases have been obtained by of such an image.
for example, theBy 6–7 Å resolution, the buildup of phase errors is becoming too great and further information needs to be added to extend the phases further. At this stage, it is often possible to recognize secondary structure in the maps and it is hoped that this information can be recycled and the power of the method extended. Furthermore, the method should be improved significantly if an effective weighting scheme can be found, as none is used at present. The formulae given in the last section are currently being tested, but no results are yet available for publication. In addition, a formula has been developed which includes both known atomic positions and a restricted volume for the unknown atoms. It is intended that phase
using the above formulae will become part of the overall phasedetermination procedure described in this paper.Acknowledgements
Julie Wilson is grateful to the BBSRC for the support afforded by the research grant 87/SBD7592.
References
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H, Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235–242. Web of Science CrossRef PubMed CAS Google Scholar
Cowtan, K. D. & Main, P. (1993). Acta Cryst. D49, 148–157. CrossRef CAS Web of Science IUCr Journals Google Scholar
Daubechies, I. (1992). Ten Lectures on Wavelets. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics. Google Scholar
Lunin, V. Y. (2000). Acta Cryst. A56, 73–84. Web of Science CrossRef CAS IUCr Journals Google Scholar
Lunin, V. Y. & Skovoroda, T. P. (1991). Acta Cryst. A47, 45–52. CrossRef CAS Web of Science IUCr Journals Google Scholar
Main, P. (1990). Acta Cryst. A46, 507–509. CrossRef CAS IUCr Journals Google Scholar
Main, P. (1998). Unpublished results. Google Scholar
Mallat, S. (1989). IEEE Trans. Pattern Anal. Mach. Intell. 315, 674–693. CrossRef Web of Science Google Scholar
Navaza, J. (1990). Acta Cryst. A46, 619–620. CrossRef Web of Science IUCr Journals Google Scholar
Wilson, J. & Main, P. (2000). Acta Cryst. D56, 625–633. Web of Science CrossRef CAS IUCr Journals Google Scholar
Zhang, K. (1993). PhD thesis. York University, England. Google Scholar
Zhang, K. & Main, P. (1990). Acta Cryst. A46, 41–46. CrossRef CAS IUCr Journals Google Scholar
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.