research papers
Differential evolution for protein crystallographic optimizations
^{a}ActiveSight and Molecular Images, 4045 Sorrento Valley Boulevard, San Diego, CA 92121, USA
^{*}Correspondence email: dmcree@activesight.com
Genetic algorithms are powerful optimizers that have been underutilized in protein crystallography, given that many crystallographic problems have characteristics that would benefit from these algorithms: nonlinearity, interdependent parameters and a complex function landscape. These functions have been implemented for realspace optimizations in a new fitting program, MIfit, for realspace of protein models and heavyatom searches. Some programming tips and examples will be presented here to aid others who might want to use genetic algorithms in their own work.
Keywords: differential evolution; genetic algorithms; optimization.
1. Introduction
Genetic algorithms (GA) include a broad class of optimizers that share several characteristics. A vector of the variables (analogous to genes in natural evolution) of the problem to be optimized is constructed and a randomized population of these vectors is created. The vectors are then scored by the means of a fitness function and the fitter vectors selected to form the next generation. The next generation is formed by mutation and crossover of the parent vectors and if the daughter thus formed is more fit than the parent then the parent is replaced. The whole process is repeated for several generations. After a number of generations the population as a whole becomes more fit (that is, it scores higher in the fitness function) and in a successful outcome individuals appear that will solve the problem. Generally, the best scoring individuals from the last generation are returned as solutions. Genetic algorithms can be differentiated from each other by the means by which they perform the mutation and crossover steps and by the manner in which the variables or `genes' are encoded. In experimenting with genetic algorithms, we have found that the choice of algorithm and the tuning of the algorithm parameters make large differences in the speed and success in solving the problem.
Several previous examples exist of the use of genetic algorithms in protein crystallography. Perhaps the best known and most widely used has been the program EPMR written by Kissinger et al. (1999). EPMR is a molecularreplacement program that optimizes the best values of three translational and three rotational parameters simultaneously using a GA. Our success with this program was responsible for our looking further into using GA for our problems. After much experimentation, we found that a particular GA method termed differential evolution (DE) was the most successful and flexible optimizer for our problems, which included rigidbody fitting in realspace and finding the optimum sidechain rotamers. We have implemented these in the program MIfit, which is available from Molecular Images (San Diego, CA, USA; http://www.molimage.com ).
2. Differential evolution
Here, we will discuss issues relevant to using DE in common crystallographic applications; a detailed description of DE, other example uses and code samples can be found in the existing literature (Price, 1999; Storm, 1999; Lampinen & Zelinka, 1999; Chisholm, 1999). In DE, the genes are floatingpoint numbers and this has advantages in scaling and over the more traditional bits or integers used in some other GAs. A common difficulty in GA optimization is the determination of the proper size of the mutation step, the amount by which a variable (or gene) is changed between generations. If the steps are too large the solution may be missed and if they are too small it will take many generations to converge. In DE, the size of the mutation step is determined by taking differences between two individuals in the population [and multiplying it by a mutation factor, F (Price, 1999), which is used as an overall control variable]. This method has the advantage of providing an adaptive scaling of the mutation as the population matures. In the early stages, the population is diverse and the mutation sizes can be large. If a variable contributes to the score produced by the fitness function, its values within the population will converge towards the best value and the variance in that variable will decrease. If the variable contributes very little or nothing to the score, its range of values will increase with time as mutations are made. This can either continue until values of the variable are found that make significant contributions, or when the maximum generation limit is reached the variable will have a large variance in the population with no dominant values. This information can be very valuable in assessing the final solution reached at the end of a run.
1.2. Restraints and constraints
Interestingly, the difference method of scaling means that a variable can be constrained to a value by constructing an initial population with no variance in that variable; that is, where all of the members of the population have the same value. In this case, the differences between individuals will always be zero and the mutation size will thus also be zero. Similarly, a variable can have integer values by seeding the population with only whole numbers; all of the differences between those variables will also be whole numbers. Many variables are bounded and can have values only in a specified range. In this case, a function can be supplied to impose these bounds between generations. Restraints, where a variable is restrained to be near a value with a given distribution, are very common in crystallographic versus positional values. Another strategy is to impose a direct selection process in between generations where the outliers would be pruned and replaced with new individuals. This has been likened to a predator–prey relationship in nature. A combination of both may produce the best results. In practice, this pruning cannot be too aggressive or the population will never be able to evolve. It will essentially be driven to extinction before it can get started! A good strategy is to add a pruning step every dozen generations or so.
of models. For example, bonds and angles are restrained to target values but allowed to vary, a scenario which is more difficult to set up in DE. One strategy is to include this expected variance as part of the fitness function. The best score would be obtained by an individual that has the best fit of positional values to the data and the best geometry. Weighting could also be incorporated into the fitness function by setting an overall weight for geometric restraints3. Random numbers and probability distributions
It is essential to use a randomnumber generator suited to the problem. This issue arises because the standard randomnumber generators included in C and Fortran are very fast but not necessarily well suited to DE. Better randomnumber generators are available in many code libraries. Two types of randomnumber generators are useful in different contexts: equally distributed and Gaussian distributed. Equally distributed suffices for many problems, but in many cases a Gaussian distribution in a variable is more desirable and may lead to faster convergence by forming a few outliers that may prove to be closer to the final value when the starting point is far from the final answer. The randomnumber generator could make use of prior probability information if it is known. For instance, with phase distributions, a range of phases that match the phase probability distribution would be used. Generating correct distributions is key to making the problem well conditioned.
4. Scoring
The fitness function is key to making a GA work. In some cases the scoring may be straightforward, comprising a simple evaluation of the current parameters of an equation. In other cases, the function could be quite complex and essentially form a program within the program. For instance, in one implementation of DE the best parameters for a checkersplaying program are evaluated via a pairwise tournament that is held between all of the members of the population and the fitness score is the number of times an individual wins (Chisholm, 1999). In nature, fitness is very involved and for humans includes an incredibly complex range of social interactions and toolmaking abilities that are eventually included in an individual's ability to reproduce. In nature, the environment an individual finds itself in is highly variable and is impactful in assessing the fitness of an individual, whereas in the computer the environment is the same for every individual.
5. Crossover
Crossovers are a very powerful component of GA optimizers which distinguish GA methods from other optimization techniques involving random fluctuations such as Monte Carlo or simulated annealing. In some GA algorithms, only crossover is used with mutations held to 0. In this case, all of the variation needed to solve the problem should exist in the initial population. The crossover functions used can vary from a simple swapping of variables to swapping of contiguous groups in an attempt to imitate gene linkage in biological recombination processes. In DE, a simple swapping of values between individuals is normally used. The effect of crossovers is to allow for `hillhopping' in complex functions that have a number of local maxima. A simple illustration of this shown in Fig. 1. The overall rate of crossovers are controlled by rolling a random number and comparing it with the crossover frequency, CR, which can be set to a value between 0 (no crossovers) and 1 (always crossover) (Price, 1999).
6. Programming example
This example is for optimizing a molecular fragment into an electrondensity map. The search has six parameters, three translations in x, y and z and three rotations about the axes of x, y and z. The goal of the optimization is to find the values of these parameters that give the highest correlation with the electron density. The scoring function is the sum of the density found at the atom centers weighted by the of the atom. If the resolution is low, densities above 2σ are truncated to prevent overweighting overlapped densities (i.e. the center of a ring such as in a phenylalanine residue). Negative densities are heavily penalized by multiplying them by a factor of five. The map density is interpolated by means of a spline function that has been shown to return values within 0.1% of the true value (Lampinen & Zelinka, 1999). A more accurate method for scoring would be to correlate the observed and calculated density. However, constructing the calculated density is a lengthy procedure involving looping over all the atoms over a sphere of density points and would slow the function considerably. Since this function is meant to be interactive in real time, the approximation has been found to be a better compromise between speed and accuracy.
To set the parameters for mutationstep size, the program first calculates the scoring function for the starting solution. If it is high, meaning that the model is a fair match to the density, then the mutationstep size is set to be smaller under the premise that the user only needs to optimize an already good solution, otherwise the mutationstep size is broadened so that a larger volume is searched for the solution. The translational variance target is set to 0.25 Å and the rotation target to 0.75°. The crossover frequency CR (Price, 1999) is set to 0.8 and the mutation factor F (Price, 1999) is set to 0.60. The population is set to 120 vectors. Each vector has six members, three translations and three rotations, and the function is run for 40 generations. These numbers were found by trial and error to be enough generations and a large enough population to lead to good convergence. In each generation, a trial daughter is constructed from each of the parent genes by crossover and mutation. If the daughter has a better score than the parent then the parent is replaced; otherwise, the parent is kept. After the 40 generations the best solution is applied to the model, the screen is updated and the score reported to the user. The results of a typical run are shown in Fig. 2 and Table 1. All this takes a fraction of a second for most fragments of a few residues and about a minute for an entire molecule of about 200 residues. The equivalent bruteforce algorithm is 24fold slower than the DE algorithm with only seven steps in each variable, which requires scoring 7^{6} or 117 649 trials, whereas the DE algorithm scores 686 trials. The final solution is also sampled on a finer scale for the DE algorithm since the last few generations take very small step sizes as virtually all of the population has converged to the same values. In comparing with a leastsquares procedure it must be remembered that the takes longer. However, many of the problems solved easily by DE would not be possible with least squares, which is only able to converge to the nearest local minimum of a target function. In cases where the starting position of the model does not overlap sufficiently with the density, no solution would be found. Thus, one of the chief advantages of the method is that it has a larger radius of convergence while remaining faster than a bruteforce search over all possibilities.

References
Chisholm, K. (1999). New Ideas in Optimization, edited by D. Corne, M. Dorigo & F. Glover, pp. 147–158. London: McGraw–Hill. Google Scholar
Kissinger, C. R., Gehlhaar, D. K. & Fogel, D. B. (1999). Acta Cryst. D55, 484–491. Web of Science CrossRef CAS IUCr Journals Google Scholar
Lampinen, J. & Zelinka, I. (1999). New Ideas in Optimization, edited by D. Corne, M. Dorigo & F. Glover, pp. 127–146. London: McGraw–Hill. Google Scholar
Price, K. V. (1999). New Ideas in Optimization, edited by D. Corne, M. Dorigo & F. Glover, pp. 79–108. London: McGraw–Hill. Google Scholar
Storm, R. (1999). New Ideas in Optimization, edited by D. Corne, M. Dorigo & F. Glover, pp. 109–125. London: McGraw–Hill. Google Scholar
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.