PRODRG: a tool for high-throughput crystallography of protein-ligand complexes.

The small-molecule topology generator PRODRG is described, which takes input from existing coordinates or various two-dimensional formats and automatically generates coordinates and molecular topologies suitable for X-ray refinement of protein-ligand complexes. Test results are described for automatic generation of topologies followed by energy minimization for a subset of compounds from the Cambridge Structural Database, which shows that, within the limits of the empirical GROMOS87 force field used, structures with good geometries are generated. X-ray refinement in X-PLOR/CNS, REFMAC and SHELX using PRODRG-generated topologies produces results comparable to refinement with topologies from the standard libraries. However, tests with distorted starting coordinates show that PRODRG topologies perform better, both in terms of ligand geometry and of crystallographic R factors.


Introduction
With the rise of structure-based drug-design techniques (reviewed in Davis et al., 2003), it is important to have software available which supports the ligand/inhibitor throughout the entire design process. Firstly, coordinates for the drug need to be built or an existing molecule modi®ed, followed by docking of the drug into the active site and/or re®nement of a protein± drug complex against X-ray diffraction data. The protein±drug interaction then needs to be examined in terms of detailed hydrogen-bonding geometry or other scoring functions (reviewed in Brooijmans & Kuntz, 2003). During this process, the drug interacts with different types of software and for each of these types a wide variety of packages are available (Davis et al., 2003). Making these computer programs understand the topology of the drug involved is often a laborious process and, when no structural information is available, prone to errors as bond lengths and angles often have to be guessed . In the current drive towards high-throughput crystallography, a large number of protein±inhibitor complexes need to be re®ned and evaluated, which increases the need for a high level of automation (Blundell et al., 2002). Similarly, signi®cant effort is currently being invested into virtual screening of small-molecule libraries using docking methods (Richards, 2002). To be able to create, dock and re®ne large libraries of small molecules, a fast, accurate and publicly available program is needed to create topological information from a variety of input formats (two-dimensional and threedimensional representations) for a wide range of computer packages used in this process. Here, a new version of the program PRODRG is described which performs all these tasks. The program is tested against the Cambridge Structural Database (CSD; Allen, 2002) and a number of protein±ligand complexes.

Details of the PRODRG algorithm 2.1. PRODRG basics
The basics of the PRODRG algorithm have been described previously (van Aalten et al., 1996); hence only a short overview will be given here. The main aim of PRODRG is to provide topological information for small molecules that can be used in X-ray re®nement, molecular-dynamics simulations, molecular modelling and docking studies. PRODRG is currently limited to molecules containing H, C, N, O, P, S, F, Cl, Br or I atoms; also, atoms with more than four bonds and certain types of bonds between halogens and non-C atoms are not supported.
Previously, PRODRG only accepted coordinates in PDB format (PDB mode) as input (van Aalten et al., 1996). This has now been expanded, with two additional input modes. The ®rst allows description of molecules as a simple ASCII drawing (TXT mode), illustrated in Fig. 1. The TXT mode represents a portable description of the molecule (it can be created and edited in any text editor in any operating system) that is easily interpreted by humans as well as machines. Single, double and triple bonds can be drawn between atoms and chirality (discussed below) is indicated by the case of the letter describing the atom. The second new mode is the popular MDL Mol®le/SD®le format (MOL mode), which is used in programs such as ChemDraw (CambridgeSoft, Massachusetts, USA) and ISIS/Draw (MDL Information Systems, California, USA) and is also written out by the Java-based JME editor (Ertl & Jacob, 1997).
The net result after initial processing by PRODRG is a connection table, containing the bonds between non-H atoms, the hybridization states and information on chirality (see van Aalten et al., 1996 for a full description). All further information, such as all coordinates and the H atoms in the input, are ignored. This has the advantage of PRODRG entering the subsequent steps with the same information regardless of what this was determined from: a small molecule input via TXT mode will thus lead to the same topology and derived information as the same molecule supplied via a high-resolution crystal structure.

Determination of protonation state
After the initial connection table has been generated, probable amide N atoms are identi®ed and the presence/ extent of aromatic systems is determined. The aromaticity detection is based on Hu È ckel's 4n + 2 rule, but is not limited to single-ring systems. With this information it is then possible to add H atoms so that the expected valencies are satis®ed, even though in some cases the program will add fewer or more H atoms, so that e.g. carboxylates remain deprotonated while guanidinium groups are fully protonated. PRODRG offers three statements for modifying the input or generated structure. Two of them, sxrh`tomb and hivrh`tomb allow modi®cation of the protonation state of any atom by either adding or removing a hydrogen to/from it (Fig. 2). The third command, egr`tomb`vlueb, is used to force the hybridization of an atom (vlue = 1, 2, 3 for sp, sp 2 or sp 3 hybridization) or to invert a chiral centre (vlue = À1). It thus provides an easy tool to modify an existing structure on the¯y, but the ability to modify hybridization assignments is also useful in case PRODRG misinterprets poor input coordinates. The TXT input mode. (a) 3,7-Dimethyl-1-propargylxanthine as an example for TXT input. Single bonds can be input as ± or |, double bonds as = or ª and triple bonds as #. Atoms must be separated by bonds, while bonds/atoms that are not connected must be separated by white space;`diagonal' connections are not accepted. H atoms may be included but will be ignored. (b) Common mistakes when entering TXT drawings. Left (ethanol) from top to bottom: correct drawing; useless inclusion of H atoms; missing bonds. Right (ethylene oxide) from top to bottom: correct drawing; no space between O and the CÐC bond; diagonal connection to O. (c) The chirality of atoms can be changed by using lower-case element symbols.

Coordinate generation and energy minimization
The H-atom assignment is followed by the generation of a topology for use with GROMACS (Berendsen et al., 1995;Lindahl et al., 2001). If desired, PRODRG can then use GROMACS to either generate coordinates ab initio for the molecule or energy-minimize user-provided coordinates. Energy minimization is performed by steepest descent for at most 50 000 steps, with the ffgmx GROMACS force ®eld, extended by 11 additional atom types to accommodate halogens, sp-hybridized atoms and other chemical features. Parameters for the new atom types have been determined from about 47 000 experimentally determined small-molecule structures from the CSD (see below).

Testing on compounds in the CSD
A set of compounds was selected from the CSD to perform a large-scale test of PRODRG topology quality. Compounds were selected if they did not contain atoms other than C, H, N, O, P, S, F, Cl, Br and I. In the case of entries containing multiple molecules, the largest molecule was chosen. This resulted in 46 964 compounds which were processed by PRODRG in less than 11 h on an 2.0 GHz AMD Athlonbased Linux system. For each compound, the full topological research papers Table 1 Statistics for the PRODRG run on $47 000 small-molecule X-ray structures selected from the CSD.
Failure owing to PRODRG limitations includes structures containing atoms with more than four connections, unsupported non-carbon±halogen bonds and molecules consisting of fewer than three atoms. Structures are considered too complex if repeated attempts at energy minimization fail to yield results of acceptable geometry in terms of the ffgmx GROMACS force ®eld.`Bad input geometry' summarizes structures of unusual geometry, the interpretation of which led to unresolvable inconsistencies, forcing PRODRG to fail.

Figure 2
Use of sxrh and hivrh to generate different protonation states of histidine. For some simple molecules PRODRG will automatically generate meaningful/standard atom names, which in this case allows the two N atoms of the imidazole ring to be addressed as ND1 and NE2.
information was generated, followed by energy minimization with the generated topology in the GROMACS package. Of the 46 964 PRODRG runs, 820 failed for the reasons described in Table 1. The 46 144 successfully processed structures were then compared with the starting structures in terms of bond lengths, bond angles, improper dihedral angles and coordinate r.m.s.d. (Fig. 3). The average r.m.s.d.s between crystallographic and PRODRG-generated structures are 0.040 A Ê on bonds, 2.99 on angles, 1.97 on improper dihedrals and 0.26 A Ê on aligned coordinates. These reasonable results re¯ect both PRODRG's ability to extract topological information from coordinates only and the quality of the GROMOS87 force-®eld-based limited parametrization used.
There are numerous other programs that generate threedimensional coordinates from connection-table data (reviewed in Sadowski et al., 1994, andupdated in Gasteiger et al., 1996). The aim of these programs is to predict accurately the`real' conformation of a compound for use in e.g. 3D-QSAR (quantitative structure±activity relationship) studies. PRODRG-generated structures, on the other hand, while generally of low energy and chemically meaningful, are neither guaranteed nor intended to represent the absolute energy minimum of an input compound. This is not necessary, as PRODRG-produced structures will normally be used as the starting point for other procedures such as model building, crystallographic re®nement, molecular dynamics or docking, which will determine the ®nal conformation.
3.2. Testing in X-ray refinement PRODRG writes out topology information which can be used in X-PLOR/CNS, REFMAC5 or SHELX to properly model small-molecule compounds during re®nement against X-ray crystallographic data. The quality of the automatically generated topologies was evaluated using a number of re®ned structures, in which the previously used small-molecule topology was substited with a PRODRG topology generated from a TXT drawing (Figs. 4a and 4b). Re®nement was then continued and initial and ®nal R factors compared, together with an indication of conformational change in the small molecule introduced by switching the topology, expressed as the r.m.s.d. on the atomic positions. In the PRODRG-research papers generated X-PLOR/CNS topologies, the bonded forces are scalable with a separate weight factor and values of 0.25, 0.5, 1.0, 2.0 and 4.0 were tested for all systems to obtain an optimum weight of the geometrical restraints versus X-ray data for the small molecule in terms of the smallest separation between R and R free . The results are presented in Table 2 and Figs. 4(a) and 4(b), showing that PRODRG topologies perform well in crystallographic re®nement.

Comparison with similar programs
3.3.1. XPLO2D/HIC-Up. The Uppsala Software Factory program XPLO2D (Kleywegt, 1995) can be used to generate topologies for use with, amongst others, X-PLOR/CNS and O from small-molecule coordinates. For small molecules present in PDB entries, the HIC-Up service (Kleywegt & Jones, 1998) provides the required coordinates (gathered from the PDB) as well as pregenerated XPLO2D topologies. Unlike PRODRG, which always uses its own GROMOS87-derived parameters, XPLO2D derives topology parameters from the input coordinates, thus implicitly assuming these are correct .
To compare the performance of XPLO2D-and PRODRGgenerated topologies for re®nement with CNS, several highresolution structures ( 1.2 A Ê ) were obtained from the PDB and re-re®ned after truncating the data to 2.8 A Ê resolution, optionally after slight perturbation (by an average random coordinate shift of 0.1 A Ê ), with topologies produced from the original ligand coordinates either by PRODRG or XPLO2D.
In all cases, the crystallographic weight was optimized to give the lowest R free . Table 3 shows that the coordinate r.m.s.d.s between the original high-resolution ligand(s) and the re-re®ned ligand(s) do not differ signi®cantly between the two topology sources. This is remarkable considering that XPLO2D, unlike PRODRG, acquires its parameters from thè perfect' input structure and thus its topologies might be expected to present a better model of this perfect structure. The values of R work as well as the real-space R factor computed with O are generally similar for PRODRG-and XPLO2D-based re®nement runs; on the other hand, R free is consistently lower when using PRODRG-generated topologies. The r.m.s.d.s for the runs with perturbed or unperturbed coordinates are essentially identical in all cases, showing that the quality of the results is not signi®cantly in¯uenced by either topology being`too loose'.
Next, the impact of the quality of the input coordinates was investigated. The re®nement of HGPRT (PDB code 1fsg) was repeated several times with XPLO2D-and PRODRGgenerated topologies produced from ligand coordinates to which an increasing random coordinate shift (from 0.05 to 0.25 A Ê ) had been applied (Fig. 4e). As expected, the XPLO2D-dependent re®nement deteriorates steadily with increasing ligand coordinate error. Because PRODRG uses tabled parameters, its topologies are less sensitive to the quality of the input coordinates, even though above an average shift of 0.15 A Ê atom-type misassignments begin to occur (intriguingly though in this case these lead to a minimal improvement in the re®ned ligand geometry). For comparison the results obtained with topologies generated independently of input coordinates are also shown (empty diamonds in Fig. 4e). In PRODRG, topologies produced from twodimensional descriptions can be expected to perform equally well or better than those derived from PDB input, as the drawings allow greater precision in the speci®cation of a compound. Indeed, in the test case the TXT-produced  Table 2 Details of X-ray re®nement tests of protein±ligand complexes using PRODRG topologies.
All measured data were included in the re®nement. The source of the original topology is indicated (S, standard library of the re®nement program; M, manually made topology; L, topology made with LIBCHECK and validated manually). The additional re®nement consisted of two cycles of 100 steps of positional re®nement followed by 20 steps of temperature-factor re®nement (CNS) or ten steps (REFMAC5). The real-space R factor was calculated using O with standard settings. ChiB, Serratia marcescens chitinase B (van Aalten et al., 2001); SCP-2L, sterol carrier protein type 2-like domain of human multifunctional enzyme type 2 (Haapalainen et al., 2001); ACBP, acyl-CoA binding protein (not published); PTR1, Leishmania major pteridine reductase 1 (Gourley et al., 2001;Schu È ttelkopf, 2003); PYP, Ectothiorhodospira halophila photoactive yellow protein (van Aalten et al., 2002); n/a, not applicable; n/d, not deposited.  Use of PRODRG-generated topologies. (a) GlcNAc 5 in ChiB (van Aalten et al., 2001). Left, stereo diagram of the ligand molecule before (cyan) and after (green) re®nement with a PRODRG-generated topology. The surrounding protein is shown as a semitransparent cartoon. Right, text drawing used to generate the topology. (b) As (a) for Triton X-100 in SCP-2L (Haapalainen et al., 2001). (c) Ligand from a high-resolution structure (cyan molecule) of human neutrophil collagenase (Gavuzzo et al., 2000) re-re®ned at lower resolution with topologies generated either with PRODRG (green molecule) or with LIBCHECK (orange molecule). Again, the protein is shown as a semitransparent cartoon. To the right, the chemical structure of the ligand [2-(biphenyl-4-sulfonyl)-1,2,3,4-tetrahydroisoquinoline-3-carboxylic acid] is given. (d) As (c) for (3-amino-2,5-dioxo-1-pyrrolidinyl)-acetic acid in Cryphonectria parasitica endothiapepsin (Erskine et al., 2003). (e, f) Effect of poor input geometries on the quality of generated topologies as indicated by the r.m.s.d. between small-molecule coordinates from the`ideal' starting structure and the same structure after re®nement at lower resolution. The re®nement of HGPRT as described in Table 3 is repeated with topologies generated from coordinates perturbed by a given random shift (®lled squares). In addition, the corresponding re®nement results using topologies produced in a coordinate-independent manner are given (empty diamonds). For PRODRG this means topologies were generated from TXT-mode drawings; for XPLO2D the topologies available from HIC-Up were used and for LIBCHECK the ligands were drawn in SKETCHER. Weights are kept at the values given in Table 3. (e) shows the results for re®nement with CNS and (f) for REFMAC5.
In the test case we obtain the most favourable results possible in terms of coordinate r.m.s.d., as the HIC-Up versions of both ligands used come from structure 1fsg and thus are identical to the`ideal' structures. 3.3.2. REFMAC5/LIBCHECK. REFMAC5 comes with a library containing topologies and parameters for several common small molecules and topologies only (`minimal descriptions') for a large number of additional molecules (Murshudov et al., 1997). Upon encountering a small molecule for which no or only a minimal description is available, REFMAC5 (using the associated program LIBCHECK; Murshudov et al., 1997) Table 3 Low-resolution (2.8 A Ê ) re-re®nement of high-resolution structures.
The differences in performance between PRODRG-and LIBCHECK-produced topologies are small in four of the six test cases. In the remaining two cases [HNC (Gavuzzo et al., 2000) and EAPA (Erskine et al., 2003)] the re®nement using PRODRG topologies gives signi®cantly better ligand conformations, with r.m.s.d. LIBCHECK /r.m.s.d. PRODRG ! 1.5. A closer look shows that in the case of HNC the large conformational difference introduced by re®nement with the LIBCHECK topology is a consequence of an inappropriate planarity restraint covering the entire biphenyl moiety of the ligand, even though in the high-resolution structure the two phenyl rings are, as would be expected, at an angle of $22 (Fig. 4c); in EAPA the geometry of a residue representing a cyclized Asp-Gly dipeptide is somewhat distorted by the LIBCHECKgenerated topology owing to two atom-type misassignments: C2 H and C3 are incorrectly typed as sp 2 -hybridized, which results in bond lengths that are too short (Fig. 4d). It should be pointed out that neither of the two poorly performing compounds exist in the REFMAC5-distributed library and thus LIBCHECK had to generate the topologies without the help of a minimal description.
The relative performance of LIBCHECK and PRODRG with lower-quality ligand coordinates was again assessed for the case of HGPRT; the results are shown in Fig. 4(f). As pointed out above, the PRODRG-generated topologies show some deterioration above a random coordinate shift of 0.15 A Ê , which can be avoided by instead de®ning the ligands through TXT drawings or other two-dimensional descriptions. LIBCHECK performs similarly to PRODRG in this case, even though its atom-type detection seems to be more sensitive to coordinate error. Like PRODRG, LIBCHECK allows the production of topologies in a truly coordinate-independent fashion by drawing them interactively in the SKETCHER program. While this obviates the need for high-quality ligand coordinates, the GUI-based procedure is relatively tedious and incompatible with high-throughput approaches.

Current limitations
The dependence of PRODRG on GROMACS (the GROMOS87 force ®eld) leads to a number of limitations in the scope of compounds that PRODRG can handle. The most notable restriction is the comparatively small number of elements that the program supports. While the current selection (H, C, N, O, P, S, F, Cl, Br, I) allows processing of a wide range of biomolecules and potential drugs, further elements covering at least B, As, Se and common metal cations such as Fe 2+/3+ or Mg 2+ would greatly extend this range. The other force-®eld-related problem is the limited number of atom types available for the supported elements, occasionally leading to a poor representation of phosphorus/sulfur chemistry and of sp-hybridized atoms. While many of these issues have been addressed in the current version of PRODRG, further improvements could be achieved with the addition and parametrization of more atom types.
Further limitations include the inability to detect certain aromatic systems such as pyrene, which possess 4n %-electrons. Also, PRODRG currently does not store information on bond types provided in Mol®les or text drawings: all computation is based solely on the hybridization state of individual atoms. Keeping bond-type data would be helpful in resolving certain ambiguities, e.g. in hydrogen placement.

Conclusions
PRODRG provides fast, automated and, within the given limitations, reliable access to small-molecule topologies and coordinates for use with high-throughput protein±ligand crystallography. Tests in crystallographic re®nement show that PRODRG-generated topologies are generally of equal quality or better than topologies obtained by other means. PRODRG obviates the requirement for high-quality input coordinates or other additional data in generating topologies, as it can operate even on two-dimensional representations of a molecule, such as the industrial standard MDL Mol®le/SD®le. It should also be noted that the variety of topologies generated by PRODRG allows the use of consistent descriptions of a given molecule in all steps of the inhibitor-design process, from crystallographic re®nement and visualization through structure analysis to molecular-dynamics or docking studies.
Additional extensions of PRODRG with applications in automated ligand design and optimization are currently being developed, as well as PRODRG-based algorithms for automated identi®cation and ®tting of small molecules in electrondensity maps. Development on the core PRODRG application aims to overcome the limitations in terms of atom types and force ®eld. A particular focus is the implementation of a new coordinate-generating mechanism which will remove the dependency on GROMACS from PRODRG, speed up coordinate production and, most importantly, open a path towards the use of different/novel force ®elds. This in turn will then allow support for additional atom types, thus extending the applicability of PRODRG.
Financial support by a Wellcome Trust Senior Fellowship and an EMBO Young Investigator Fellowship (to DvA) is gratefully acknowledged. We would like to thank Charlie Bond for valuable discussions and critical reading research papers of the manuscript. For academic research purposes PRODRG is freely available as a WWW service at http:// davapc1.bioch.dundee.ac.uk/prodrg/. Binaries for Linux, IRIX, FreeBSD or Windows are available upon request.