research papers
SPINE workshop on automated X-ray analysis: a progress report
aThe Division of Structural Biology and Oxford Protein Production Facility, The Wellcome Trust Centre for Human Genetics, Roosevelt Drive, Oxford OX3 7BN, England, bCCLRC, Daresbury Laboratory, Warrington WA4 4AD, England, cNetherlands Cancer Institute, Molecular Carcinogenesis Department, Plesmanlaan 121, 1066 CX Amsterdam, The Netherlands, dYork Structural Biology Laboratory, Department of Chemistry, University of York, Heslington, York YO10 5YW, England, and eEMBL Hamburg, Building 25A, DESY, Notkestrasse 85, 22603 Hamburg, Germany
*Correspondence e-mail: keith@ysbl.york.ac.uk
The Structural Proteomics In Europe (SPINE) consortium contained a workpackage to address the automated X-ray analysis of macromolecules. The aim of this workpackage was to increase the throughput of three-dimensional structures while maintaining the high quality of conventional analyses. SPINE was able to bring together developers of software with users from the partner laboratories. Here, the results of a workshop organized by the consortium to evaluate software developed in the member laboratories against a set of bacterial targets are described. The major emphasis was on molecular-replacement suites, where automation was most advanced. Data processing and analysis, use of experimental phases and model construction were also addressed, albeit at a lower level.
Keywords: automation; molecular replacement; data quality; model-building tools.
1. Introduction
The Structural Proteomics In Europe (SPINE) project, initiated in 2002, aimed to introduce new technologies and approaches to the complete set of processes required to determine the three-dimensional structures of biomedically relevant proteins. It was envisaged that the majority of these structures would be determined using X-ray crystallography and a distinct section of the programme was devoted to this method. The stated aims of this work package were
to address the problems of automated X-ray analysis of macromolecules. To achieve throughput in keeping with genome sequencing projects, macromolecular crystallography (MX) procedures must be streamlined, and work in a number of laboratories in Europe, including several SPINE partners, is directly addressing this. Scripting will link the various stages, and better algorithms will be formulated in key areas such as
(MR), experimental phasing, automated generation of atomic models, molecular graphics and quality assessment. The software will ensure that high quality will accompany high throughput.
Within the SPINE project, most of the resources were devoted to major bottlenecks for structural biology, namely in protein cloning, overexpression and crystallization. Hence, SPINE had only limited resources to contribute to the development of high-throughput crystallographic computing, but by bringing together major users and providers of code it was in a good position to gain access and provide some input to developments. This problem is being addressed worldwide. It is clear that contacts and coordination are essential to optimize the output of developers and that such contacts must be maintained. Early in the programme SPINE held two workshops to discuss automation, attended by people both within the project and from associated groups. This report summarizes the activities at a third workshop where the current methodology was tested against targets selected from SPINE partner laboratories in Oxford and York. It does not describe in detail the software being developed or the structures of the individual targets, as these will be published elsewhere.
The SPINE project has tried to follow the traditional CCP4 (Collaborative Computational Project, Number 4, 1994) approach of linking contributions from a number of sources [such as ARP/wARP (Perrakis et al., 1999) and SHELX (Schneider & Sheldrick, 2002)] to form a set of modular tools. This requires agreement on exchange protocols, which can be hard to establish, but will result in more robust and flexible software which can be easily upgraded in the years to come. The aim of the workshop was to assess progress towards this within the SPINE team and its associates.
2. The target data sets
23 data sets were selected from a group of bacterial (mainly Bacillus anthracis and Campylobacter jejuni; Alzari et al., 2006) targets under study in Oxford and York (Table 1). Merged structure factors and the amino-acid sequence data were the basis for most of the activity; however, for a subset of targets raw images were made available for assessment of an automated processing protocol. In the event, this effort was essentially restricted to a single problem data set (OPPF1314).
|
The basic selection parameters were that target proteins should be less than 50 kDa, not part of a complex, contain no signal ARP/wARP packages of automated electron-density interpretation.
and have no transmembrane regions. Most were candidates for MR and were straightforward targets for the subsequent application of theDuring the workshop two structures were examined in greater detail to pinpoint problems in the structure-automation pipelines. These were OPPF1314 (Oxford) and SiaP (York).
2.1. OPPF1314
OPPF1314 data were used both for testing the data processing and analysis pipeline and for the automated model-building procedures. OPPF1314 is a 5-formyltetrahydrofolate cycloligase (BA4489) and has a molecular weight of 22.3 kDa (292 residues). The protein catalyses the ATP-dependent formation of 5,10-methenyltetrahydrofolate from 5-formyltetrahydrofolate (folinic acid; Huennekens et al., 1984).
The full details of the et al., 2006). Briefly, crystals were obtained from cocrystallizations of OPPF1314 with the substrates ATP and 5-formyltetrahydrofolate and data extending to 1.5 Å resolution were measured on ID14EH1 at the ESRF from a crystal belonging to P1 containing two molecules in the crystallographic Data were acquired in a high-resolution pass (in which many of the low-resolution reflections were overloaded) followed by a low-resolution pass. The diffraction showed a high degree of mosaicity. Data reduction with DENZO/SCALEPACK (Otwinowski & Minor, 1997) prior to the workshop gave an apparently reasonable merged data set, but it had proved impossible to phase this satisfactorily by MR using either a medium-resolution model structure obtained previously in a different or a related structure (PDB code 1ydm ) with 47% sequence identity.
will be described elsewhere (Meier2.2. SiaP
The SiaP protein, a candidate for ), was used during the workshop to test using experimental phasing to kick-start automated model building. One structure in the PDB, 1k7k , had some (25%) sequence identity, but only over a third of the molecule. MAD data sets had been collected for SeMet-labelled protein at three wavelengths (0.97907, 0.90778 and 0.97920 Å) on BM14, the UK MAD beamline at the ESRF. The SeMet crystals diffracted to 2.6 Å resolution with high merging R factors in the outer ranges and belonged to P21212, with two molecules in the crystallographic Although the data between 2.9 and 2.6 Å resolution were especially weak [I/σ(I) = 1.5 in the outer shell], they proved essential for structure solution.
(Table 116 Se atoms were expected in the SHELXC and SHELXD (Schneider & Sheldrick, 2002) found 14 sites. Phases had been calculated with SHELXE but automated construction of a model using REFMAC-ARP/wARP from these phases had failed: the procedure built many short disconnected with no side chains docked. The same heavy-atom solution was used with greater success in RESOLVE (Terwilliger, 2003), which built 468 residues in ∼44 chains, but with only 75 side chains docked. This model was in turn fed to REFMAC-ARP/wARP with the `default' options, but the procedure dismembered the model rather than adding additional features. The situation was improved by using the RESOLVE model with phase restraints imposed during REFMAC cycles. For this purpose, the reference phase set was based on the original SHELXE phases from the selenium improved by solvent flattening to give a single `best' phase estimate with an associated figure of merit. This gave a better result, with R converging at around 30.4% (Rfree was not used) for a model with 260 backbone residues in 27 chains and 65 side chains docked. This again reflected some degradation of the RESOLVE model. All this work was carried out prior to the workshop.
and3. Software suites and developers involved
Extensive use was made of CCP4 modules and utilities. The XIA-DPA system was used for the processing of X-ray images and their subsequent analysis and quality assessment. XIA-DPA provides wrappers for software packages including LABELIT (Sauter et al., 2004), MOSFLM (Leslie, 1999), XDS and XSCALE (Kabsch, 1993), SCALA (Evans, 1993), TRUNCATE (French & Wilson, 1978) and SFCHECK (Vaguine et al., 1999).
Sequence analysis and putative MR model identification used well established software packages available on the web, particularly MSDtarget and MSDfold from EBI (https://www.ebi.ac.uk/msd ) and BLAST (https://www.ncbi.nlm.nih.gov/BLAST/ ).
MR was carried out by three teams. All used established core software, but different scripted protocols. One group (RK and MW) has developed MrBUMP, which uses existing web servers (described below) to select multiple models, modifies them using CHAINSAW (Stein, unpublished work), MOLREP (Vagin & Teplyakov, 1997) or PDBCLIP (a local utility), followed by application of MOLREP or Phaser (McCoy et al., 2005) for the MR search. Models were assessed using REFMAC (Murshudov et al., 1997). A second group (NS and CB) has developed AutoAMoRe, incorporating CHAINSAW and AMoRe (Navaza, 1994). The third group (GNM, FL and AAV) has developed a package BALBES that utilizes a pre-constructed database for model selection, SFCHECK for data-quality analysis, MOLREP for model preparation and and REFMAC for initial and final quality assessment. Sequence analysis and model identification use programs and procedures under development by the authors (Murshudov, private communication).
With the exception of SiaP, there was little work on experimental phasing, as the automated pipelines are at an earlier stage of development. However, several of the Oxford structures requiring experimental phasing had been solved prior to the workshop using SHELXD and SHELXE (Schneider & Sheldrick, 2002).
Model construction and rebuilding for both MR and experimentally phased maps was primarily based on the REFMAC–ARP/wARP pipeline (SC, GL; Perrakis et al., 1999). Maps were visualized using Coot (Emsley & Cowtan, 2004) and development versions of Pirate (Cowtan, 2000) and Buccaneer (Cowtan, 2001) were tested. Ligand fitting was attempted using both ARP/wARP and Coot.
4. Data processing
4.1. Data integration
XIA-DPA was applied to targets where images were available. XIA-DPA is an automated wrapper for existing data-processing and analysis software. It aims to combine independently developed functionality in a modular manner so that it will be straightforward to replace individual functions. XIA-DPA incorporates these into an expert system capable of making decisions about how to process data without user intervention.
The user interface to XIA-DPA is simple: the filename of an image is sufficient to initiate data processing tasks for either two-dimensional or three-dimensional integration: xia-autoprocess-2d /path/to/data/set/foo_1_001.img or xia-autoprocess-3d /path/to/data/set/foo_1_001.img.
The current software distribution uses LABELIT to perform the autoindexing followed by two-dimensional integration with MOSFLM or three-dimensional integration using XDS. POINTLESS (Evans, 2006) is used to select the most likely Scaling and merging can be performed with SCALA and TRUNCATE or by XSCALE. Images are processed to provide reduced and scaled, merged and unmerged reflection files in the commonly useful formats (MTZ with I+, I−, I, F+, F−, F and SCALEPACK) along with estimates of the resolution and lists of possible space groups from an analysis of the The objective is to provide the initial stages of a `data-to-structure' pipeline to generate machine-readable information to be used in the subsequent steps of structure solution.
4.2. Data analysis and quality assessment
It was realised very early in the workshop that the experimental data as provided often did not carry all of the necessary crystal information in a form accessible by user or computer. Some of the information such as wavelength should be recorded in the reflection-file header. We suggest that a simple solution would be to define an accepted exchange format and record tagged information conforming to this format in an exchange file.
Decisions to be taken in automatic procedures fall into four categories.
Most of the necessary information is already available in the output from various programs, but is not yet encoded into an accepted exchange file. The data sets used for the workshop were evaluated retrospectively using TRUNCATE and SFCHECK (Table 2).
|
4.3. Test applications
4.3.1. OPPF1314
Analysis of the merged OPPF1314 data had shown an unusual distribution of reflection intensities, in particular in the low-resolution range (Fig. 1). The original images were reprocessed during the workshop with XIA-DPA, trying both the two-dimensional and three-dimensional options. The two-dimensional pipeline (i.e. using MOSFLM and SCALA) produced similar results to those from DENZO/SCALEPACK; again the intensity statistics were unusual. However, the three-dimensional pipeline (XDS and XSCALE) gave a well merged data set which led to a successful structure solution in a straightforward manner. More detailed analysis and careful reprocessing with MOSFLM/SCALA subsequent to the workshop suggested that the first failures were a consequence of poor relative scaling of the low- and high-resolution passes and that the high mosaic spread proved difficult to handle with the current two-dimensional software. The scaling R factor between the XDS and reprocessed MOSFLM amplitudes is ∼4% to 2.3 Å, rising to 16% at 1.5 Å. The quality assessment (Tables 2 and 3) suggested that the two-dimensional integration with MOSFLM was unsatisfactory in the higher resolution ranges. SFCHECK showed that the high mosaicity reduced the completeness of the two-dimensional data set in certain zones.
|
4.3.2. Other targets
Diffraction images for a further six targets were reprocessed at the workshop using XIA-DPA using the two-dimensional option (Table 4). The agreement with the data provided for the workshop for BA0592 appears to be satisfactory and the quality assessment for the other data sets was acceptable (Table 2). BA0296 illustrates a situation where rapid automated data processing should have been performed during data collection to select the best strategy. Autoindexing was satisfied by a cubic However, subsequent analysis showed that the was tetragonal. This unfortunately resulted in a rather incomplete data set. This workshop greatly stimulated the further development of the XIA-DPA pipeline.
|
4.4. Discussion; lessons for automated data processing
The data-processing and analysis step is critical for the structure-solution pipeline. The XIA-DPA pipeline performs adequately for many data sets, but it is essential that it flags aberrant cases and alerts the crystallographer in the more challenging cases, such as OPPF1314. In the case of BA0296 the information derived from initial indexing was later found to be incorrect and this highlights the need for easy and reliable determination of the from limited data during data collection.
A final assessment of data quality at the end of an automated procedure is good practice and should always be carried out. Automated procedures have two features to offer, reproducibility and standardization, which should allow for more objective summary statistics. They may also perform the tedious transformations between different packages, enabling all statistics to be calculated by the same program and hence be more comparable.
Finally, in any automation effort, the success of the pipeline depends on the cumulative success of the individual steps. As data-processing programs become more reliable and sophisticated, the overall success rate of the pipeline should improve.
5. Molecular-replacement pipelines
Of all the automated structure-solution pipelines, those addressing MR are the most advanced at present and were the core activity at the workshop. Each pipeline has five basic tasks to address.
5.1. Individual pipelines
In this section, we report on the results obtained by the three teams. All teams used some of the same tools for the various tasks. CHAINSAW (next release of the CCP4 suite) is a new utility developed for manipulating models. It examines the sequence alignment between target and template provided in a standard format and modifies the template PDB file by pruning non-conserved side chains back to the γ atom while leaving conserved residues unchanged. Atom and residue names and numbers are matched to those in the target. The result is what Schwarzenbacher et al. (2004) have termed a `mixed model', since more atoms are preserved than in a polyalanine model, but parts of the model which are unlikely to be present in the and thus may degrade the signal, are pruned.
MOLREP also contains many model-preparation tools. It aligns a given sequence with the model and prepares a truncated model. It can also carry out locked translation searches using a given transformation. AMoRe is a well established package which separates the rotation, translation and rigid-body modules, allowing great flexibility in tailoring protocol to problem. Phaser has a sophisticated scoring scheme and also recalculates the orientation search for multiple molecules, taking account of the contribution from any model positioned previously. REFMAC5 was used to assess the quality of the solutions; if both the R and Rfree fell, then a solution was judged to be substantially correct.
5.1.1. The MrBUMP package
The MrBUMP package has been developed as part of the eHTPX (https://www.e-htpx.ac.uk/ ) and CCP4 projects: eHTPX provided extensive computing resources in the form of an 18-CPU cluster accessed via an eHTPX web service. This mode is particularly valuable for marginal cases with low sequence homology where a parallel approach can be used with a range of putative trial models and methodologies being investigated contemporaneously. Alternatively, the MrBUMP package can be customized to run on a desktop and it was tested in both modes during the workshop. It is designed to make use of web-accessible databases of sequences and structures, rather than relying on local databases. This guarantees that the information is up to date, but has the drawback that queries are submitted across a public network and may be slower. Since the workshop, MrBUMP has been extended to allow this search to be performed locally.
In brief, the MrBUMP pipeline comprises the following steps: the properties of the target are generated from the reflection and sequence files provided (for example, the expected number of molecules in the asymmetric unit) and then a FASTA search of the current PDB is made to generate a list of possible homologous structures, which are then downloaded. For each, the PQS server (https://pqs.ebi.ac.uk/ ) is queried to ascertain whether the model exists as a multimer. If so, and if the multimer could fit into the target it is added to the list of templates. A recent addition, not available for the tests described here, is to add domains identified by SCOP (Murzin et al., 1995) to the list of templates.
The next step is to generate trial models from the templates. Currently, three methods are used. Firstly, in the PDBCLIP method the raw template coordinates are used, after various tidying steps such as removal of waters, removal of alternative conformations etc. Secondly, the alignment and model-improvement method of MOLREP is used. Thirdly, the CCP4 program CHAINSAW described above is used to provide a mixed model.
The top models are passed to MOLREP for the first attempt at MR. If MOLREP produces a solution (irrespective of score), the positioned model is passed to REFMAC for 30 cycles of The free R factor is used as the criterion for success. If the Rfree drops significantly, the script stops and reports details of the solution. Marginal solutions are identified without stopping the script. Unless a clear solution is obtained, the MrBUMP script continues to process all trial models through MOLREP, after which the process is repeated using Phaser as the MR engine. For the cluster implementation of MrBUMP, trial models are processed in parallel, with a success on any cluster node stopping the script on all nodes.
A summary of MrBUMP results is given in Table 5. Unless explicitly indicated otherwise, these results were obtained during the workshop, without previous knowledge of the structures and without any ad hoc customization of the default script. The only exception to this is that if the actual target structure had been deposited, the script was run with this structure excluded. It became clear that the criterion for a good solution was too strict and in many cases the script continued processing trial models after a good solution had been found. In these cases, Table 5 shows one or more of the solutions identified as `marginal' rather than as `success'. For comparison, in some cases, the solution from the Phaser loop is shown alongside that from the MOLREP loop. The result of the MR step is a positioned but inaccurate and incomplete model. The cycles of often indicate that the model will refine and that a final model is likely to be realised (see, for example, BA4499). In other cases, there is no such clear-cut indication (for example, BA1563) and conclusions on the success of the procedure must await model rebuilding.
|
The second column in Table 5 shows the actual number of molecules in the In three cases, the automated script overestimated the correct number: BA1483, BA3935_2 and BA0592. For the last two this did not matter, since MOLREP failed to find the predicted last molecule and the proceeded with the correct number of molecules. In the first case, MOLREP found seven molecules, so that the final model has a spurious extra molecule. At present, the MrBUMP pipeline does not deal explicitly with translational NCS, as occurs for example in OPPF651. In the majority of cases, there were several models and methods that could be used to solve the structure and the choice selected by the script (after identifying a `success') or the author (after examining `marginal' solutions) is largely arbitrary. Often, the structure could be solved both with the monomer search model and with a multimer. For example, BA0288 could be solved with chain A of 1ul1 or with the octamer downloaded from the PQS server. Both refine quickly to adequate R values, though the constrained geometry of the octamer leads to a slightly poorer result. The advantage of the multimeric search is speed and potentially signal-to-noise ratio and in general it would be tried first.
Of the 17 structures attempted (ignoring duplicate entries in Table 5), ten were essentially solved, six were possibly solved but require further investigation and one was clearly unsolved. It should be stressed that these conclusions are based on statistics from the MR and programs and in some cases from rebuilding in ARP/wARP and are provisional pending structure completion. In the case of CJ0982, which is deemed unsolved, the initial homology search yields only one hit which has a sequence identity of 44% but with an alignment length of only 70 residues. As discussed elsewhere, there were problems with the data processing of OPPF1314 and the result quoted in Table 5 is against the original problematic data truncated to 2.3 Å resolution.
The MrBUMP package is still under development. The current version is expected to automate structure solution via MR in straightforward examples. For the current test cases, it identified solutions or likely solutions in the majority of cases. Many of these cases had good homologues available in the PDB and could be solved by any reasonable method. For these, the advantage of MrBUMP is simply one of convenience, in particular when several homologues need to be tried and compared.
MrBUMP requires the CCP4 package plus a small number of helper applications and it was installed at York without problems. It is currently run from a simple shell script and users at the workshop found it easy to run the package for themselves. MrBUMP provides a framework within which further developments can be made in order to tackle more difficult cases. Ongoing work addresses both the algorithms applied in each step and the connecting work flow. Since the workshop, MrBUMP has been made available at https://www.ccp4.ac.uk/martyn/BMP/mrbump.php and feedback is encouraged.
5.1.2. Automated with AutoAMoRe
The advantages of AMoRe are its speed and flexibility. AutoAMoRe is a Python script produced as part of the CCP4 automation project to automate the numerous steps in solving a structure by MR using AMoRe. The AutoAMoRe script calls various CCP4 utility programs. It checks the native Patterson for translational NCS and, if appropriate, positions molecules on a pairwise basis. The final coordinates are generated using PDBSET and checked for clashes with DISTANG. AutoAMoRe generates a concise summary file of important parameters. The methodology adopted was to feed the target sequence into the BLAST server and choose as the template the solved structure with highest homology, after excluding any with 100% identity. The coordinates were downloaded from the EBI (https://www.ebi.ac.uk/ ) and passed through CHAINSAW. This selection and manipulation was performed manually and only the subsequent MR calculations with AMoRe were automated. However, the AutoAMoRe script has since been incorporated as a module in the MrBUMP pipeline. AutoAMoRe was run on 18 of the target data sets using a monomer model in each case. Solutions were scored by inspecting the final for all molecules in the and any solution with greater than 20 clashes was rejected. The highest scoring solution was subjected to ten cycles of using REFMAC with the default parameters from the CCP4i GUI. If the Rfree fell by more than 5% during the solution was deemed to be successful. Six structures were solved successfully using the version of the AutoAMoRe software available at the workshop. A number of subsequent improvements to the software resulted in solutions to another four cases, an overall success rate of 55%. The results are summarized in Table 6.
|
5.1.3. BALBES
BALBES is a system for automatic MR developed by FL, AAV and GNM. It has three main components: the PDB (Bernstein et al., 1977; Berman et al., 2000), which has been reorganized to aid model selection, a Python script which controls the work flow and makes decisions and scientific programs to perform the actual calculations. The script uses the following programs: SFCHECK for structure-factor analysis, MOLREP for and REFMAC for Several other programs for purposes such as alignment and searching in the reorganized PDB have been developed.
The ∼30 000 structures in the PDB have been reorganized and classified according to sequence and three-dimensional structure. Redundant entries are removed if two proteins have a sequence identity above 90% or if the root-mean-square deviation between matched atom pairs after superposition is less than 1 Å. This reduces the number of entries to a reference set of ∼10 000 structures, which are organized into a hierarchical database based on similarity. For each entry, potential multimers and domain structures are established and catalogued.
The Python script reads the experimental data and sequence information for the protein under investigation. The reorganized PDB is searched for related sequences, candidate models are identified and coordinates returned with multimers and domains where appropriate. The whole search takes around 10 s on a Macintosh G5 computer. The putative models are modified according to sequence identity and surface accessibility. The experimental data are analysed using SFCHECK, which indicates problem features such as pseudo-translation, or anisotropy and suggests the best resolution for the MR search. Information from these analyses is passed to MOLREP. Several protocols are tested in order: first with multimers, then with individual subunits and then with domains. After each protocol, a decision is made as to whether the `solution' is correct. If the number of expected monomers is not yet satisfied, then MR is continued. During the workshop models were passed directly to ARP/wARP for rebuilding. Subsequently, a better protocol has been implemented: the MR solution is first passed to REFMAC for a number of cycles of rigid-body followed by This has led to substantially better results for the rebuilding.
BALBES is at the development stage and the database is updated automatically and regularly. For the current tests the database includes PDB entries released by the end of 2004.
Only a subset of the available protocols was necessary for the workshop examples. These included simple MR with one subunit (two cases, one successful), a search with dimers (four cases, two successful, one probably successful), a stepwise search for multiple subunits (13 cases, ten successful and one probable) and a use of pseudo-translation (four cases, two successful, one probable). More sophisticated protocols such as domain searches, multi-copy searches, iterative .
and MR were not required for the current tests. Experience at the workshop suggested that implementation of several protocols needed to be faster. The results are summarized in Table 7
|
5.2. Summary of molecular-replacement pipelines
Out of the structures considered, the majority have a close homologue available and are straightforward to solve by MR. The minority that are not straightforward to solve are the interesting examples for methods development and will be the focus of further work. The difficulty may arise from problems in the data processing. In other cases, more sophisticated model generation may be needed or experimental phasing is required.
The solution of the OPPF1314 structure was one of the key achievements of the workshop. All three pipelines were able to determine the solution with either the low-resolution data integrated using DENZO/SCALEPACK or that processed using the three-dimensional XDS option of XIA-DPA. Refining and building a complete model proved more challenging and was only possible after reprocessing the data.
Comparison of the results with MrBUMP and BALBES at the workshop showed that even for identical models ARP/wARP performed better with the MrBUMP solutions. A key difference between the two approaches was the absence of an automated step for BALBES prior to attempting to rebuild the model with ARP/wARP. The introduction of into the BALBES protocol subsequent to the workshop now gives highly comparable results for the two pipelines (Tables 5 and 7).
At the simplest level, the role of automation is one of convenience, providing a solution with little user effort. However, it is likely that automation can provide more objective criteria of relative success for different models and also optimize the solution to minimize the effort required later for model rebuilding and completion.
6. Experimental phasing
Experimental phasing was not addressed in any depth as part of the automation testing, as the pipelines are at an earlier stage of development. Again, it became clear that many key items of information were not directly available, e.g. several reflection files did not provide correct wavelength information or document MAD data sets. OPPF1294, OPPF1311, OPPF2088 and OPPF2153 had all been solved previously using the SHELX suite. However, the SiaP structure was phased and largely built during the workshop, providing insight into how best to use weak phase information for automated model building at moderate resolution.
The principal software vehicle for this investigation was Pirate, a statistical phase-improvement program (Cowtan, 2000). Pirate classifies the electron density by sparseness/denseness and order/disorder without requiring knowledge of the solvent content. Statistical targets are constructed, from which the distribution of probable density values is inferred on the basis of local density mean and variance. These targets are optimized to the problem at hand by the use of a known `reference' structure which is manipulated by a process of scaling and error simulation to produce a map which is statistically similar to the map under examination. The software, which is still under development, is designed to be used in a fully automated manner.
The SiaP structure was phased using only the peak data from the Se sites found using SHELXD and initial phasing performed by SHELXE. This uses a solvent-flattening procedure to refine the initial SAD estimates and outputs phases and associated figures of merit. During the workshop, phases were recalculated using MLPHARE to record Hendrickson–Lattman (HL) coefficients. The average figure of merit was 0.43, falling to 0.15 at 2.7 Å. Pirate was used to improve these SAD phases, reducing the overall phase error from 64 to 48° (evaluated against the final model fully refined after the workshop) and providing better and much more realistic estimates of the figure of merit. Model completion was attempted from both these starting phase sets (§7.4).
7. Constructing and completing models
There was not sufficient time at the workshop to explore fully the best approach to the problem of model preparation. In cases where the resolution of the data set extends to ∼2.3 Å, packages such as ARP/wARP can often build a model automatically, providing that there are sufficiently good quality starting phases. ARP/wARP was run on all suitable MR solutions and the results are given in Table 8; this exercise was subsequently repeated using pyWARP (Cohen et al., 2004; Table 9). However, model building is a real stumbling block for lower resolution data sets and models with low sequence identity (typically less than 30%). Full automation of the MR pipeline for low-resolution data may require the incorporation of new modules such as Buccaneer (Cowtan, 1998, 2001), which is designed to recognize larger structural features. In all cases it is necessary to complete the model using a graphical display and Coot (Emsley & Cowtan, 2004) can provide this functionality. These modules are now described in more detail.
|
7.1. ARP/wARP
ARP/wARP was used to evaluate the quality of the solutions obtained from the different MR pipelines. Each solution was input to the currently distributed version of ARP/wARP (v.6.1.1) and ten rebuilding cycles were performed (each comprising five update cycles) starting from the positioned model (using the mode described in Perrakis et al., 1999, 2001). In this procedure, most of the stereochemical information is preserved as long as possible during the building, giving the program REFMAC more restraints.
Some of the data sets highlighted an error in the sequence-docking/side-chain building module of ARP/wARP, which occurred when a main-chain fragment became longer than the provided sequences. This was fixed following the meeting and the corrected version used to generate Tables 8 and 9. All solutions now run through to the preset end, with the exception of the BALBES solution for OPPF1311, which still causes problems.
The available MR solutions were used to evaluate pyWARP, a new control system for ARP/wARP currently under development (Cohen et al., 2004). This control system makes run-time decisions based on the current status of the model. Tables 8 and 9 also show the results from pyWARP, which appears to perform a little better than ARP/wARP and clearly docks a greater portion of the traced main chain into the sequence. Indeed, in difficult cases (OPPF2153 and OPPF2245) pyWARP significantly improved the completeness of the autotraced model, showing the value of using variable parameterization during the procedure.
7.2. Buccaneer
Buccaneer is a new model-building program which makes repeated application of a single optimized feature-recognition technique. The process involves the construction of an optimal likelihood density target for the electron density in a 4 Å sphere around a typical Cα atom (the idea, but not the target function, is similar to that in Ioerger & Sacchettini, 2002). The `optimization' of the target is similar to that described in §6. A six-dimensional search is made in the unsolved map for likely Cα atom positions using Fast Fourier Feature Recognition (Cowtan, 2001). Once initial candidate positions are obtained, a `growing' procedure is applied to find chain fragments. New residues are added in each direction using the Ramachandran plot to constrain chain geometry and a two-residue-deep search to rank the fit to density of the new positions. It is computed using the same likelihood density target, but now calculated in real space. The procedure continues until the fit to density falls below some threshold. The next stage is to combine and merge overlapping chain fragments, using the Coot utility GLOBULARISE-PROTEIN. The approach is implemented using the CLIPPER libraries (Cowtan, 2003). Despite its simplicity, it quickly rebuilt missing features for two of the lower resolution MR structures, BA1071 and BA4508. Its performance with the SiaP structure is described below. It can be integrated into a recycling scheme including density modification and Other improvements are possible, such as using bi-residue groups in common conformations.
7.3. Coot
Coot is a molecular-graphics application for protein map interpretation and structure validation. The workshop highlighted a number of its strengths, but also some missing features. It proved very powerful for rebuilding initial models from automated model building. The validation tools identify poorly built regions of the model, both by a variety of geometrical indicators and by fit-to-density analysis. These regions can be rapidly improved by interactive real-space and regularization. Although Coot was not used greatly in its role as a model-completion and validation tool by this workshop, missing features were highlighted including a means to reverse a baton-built Cα trace, better tools for joining fragments, a user interface for the automatic restoration of side chains and tools to correct out-of-register errors (a more substantial problem). Some of these deficiencies have now been addressed. Coot presently implements all these algorithms from a graphical interface, but it should be possible to incorporate the underlying functionality into a command-line-driven program suitable for an automated pipeline.
7.4. Experimental phasing case history: SiaP
Several attempts were made to rebuild this structure using ARP/wARP. In §2.2 we describe the results obtained before the workshop. The breakthrough came from the procedures described in §6, namely when the SAD experimental phase distributions in the form of the Hendrickson–Lattman coefficients calculated using the MLPHARE program were provided as restraints to REFMAC-ARP/wARP. Firstly, the procedure was started from the RESOLVE partial model and the MLPHARE phases and within 25 cycles it had built 560 of the expected 612 residues, with 545 side chains docked. Secondly, the more straightforward approach of feeding the Hendrickson–Lattman coefficients directly into the ARP/wARP procedure used for building the initial model was attempted. This took longer, but effectively reached the same solution. The third and fourth tests used Buccaneer to construct an initial model. The third test began from the MLPHARE phases, from which Buccaneer built a polyalanine model of 288 residues (47% of the total). The fourth test used the improved phases from Pirate and with these Buccaneer was able to construct a 384-residue polyalanine model (63%). Both these models were able to kick-start the ARP/wARP procedure and speeded up its convergence considerably. Starting from the Pirate/Buccaneer model, ARP/wARP completely built 578 residues (94%).
With the current state of developments this is an impressive result: the automated building and
of an essentially complete protein structure with rather weak 2.6 Å SAD data. The application of any specific program in this solution is certainly less important than the retention of the full experimental phase distribution as restraints in the model-building/refinement stage.This result has influenced several developments within the CCP4 and York automation pipelines currently under construction.
|
7.5. Molecular-replacement case history: OPPF1314
Before the workshop, the DENZO/SCALEPACK data set from the high-resolution data-collection pass alone had given a clear MR solution with the expected two molecules in the Structure completion with ARP/wARP was partly successful, but convergence was slow: missing data at low-resolution inevitably degrade the electron density which ARP/wARP requires for selection and rejection of atomic sites.
After initial reprocessing with the XIA-DPA three-dimensional option, the data from the combined high- and low-resolution passes showed an expected distribution in reflection intensities (§4.3; Fig. 1). This allowed ARP/wARP to produce a structure containing 329 residues in nine chains out of the 400 residues expected in the (for the final cycle of rebuilding the R factor was 0.234 with an Rfree of 0.289). The maps showed that one part of each molecule was poorly ordered, explaining the missing residues. Further inspection of the maps revealed additional features in the electron density that were not part of the protein and could be attributed to bound ligands (§7.6).
The two-dimensional integrated data also led to a similarly successful result (Table 9) after reprocessing using the improved XIA-DPA software (§4.3). However, with data from both the three-dimensional and two-dimensional integration ARP/wARP built more residues when the data were restricted to 1.85 Å resolution rather than using the full range to 1.5 Å. This may reflect the optimization of ARP/wARP, the poorer quality of the outer shells or be the result of residual errors arising from the effect of the mosaicity on the highest resolution data. In all cases, the extra functionality of pyWARP proved its worth.
Since the workshop, R factor is 0.219 (with an Rfree of 0.265).
of this structure has been completed, giving a model containing 192 residues from each chain, one ADP bound to each chain and 256 modelled waters. The currentWhat conclusions can be drawn from this case study?
|
7.6. Ligand binding to OPPF1314
7.6.1. ARP/wARP LigandBuild
Once a model is close to completion, the remaining density can be searched for small-molecule ligands. This procedure can be accomplished using the ARP/wARP suite (v.6.1.1) with CCP4 and a text editor. The ARP/wARP LigandBuild graphical user interface requires structure-factor amplitudes, protein coordinates without any HETATM entries (used to generate a mask) and a set of coordinates for the known ligand (Zwart et al., 2004).
For OPPF1314, the extra density after model building (§7.5) was assumed to be attributable to one or both of the cofactors in the crystallization screen, ATP and 5-formyltetrahydrofolate, which have a somewhat similar shape. Automated ligand fitting was attempted for both cofactors. The first and second trials failed; the map was still too noisy and wrong sites with impossible conformations were found. In each case, the volume covered by these was then added to the mask and the procedure was repeated. The correct sites were found in the third and fourth attempts and were verified by inspection of the electron density. The fit with ATP was clearly superior and after inspection of the electron density it was concluded that each of the two molecules in the bound a well ordered ADP (Fig. 2). Although there were other residual density features, they could not be unambiguously attributed to 5-formyltetrahydrofolate.
The result showed the need for updating masks. The PDB file was modified automatically to add extra `atoms' at each cycle. This is also needed when searching for multiple ligands, where it is recommended to build the largest first. It also showed how the recognition capabilities of the software are limited by noise in the map, as the search currently only checks a limited number of features for a potential ligand site.
7.6.2. Ligand building with Coot
Coot has the ability to search a map for likely ligand sites. It uses the REFMAC monomer dictionary to provide a description of the ligand geometry and also needs a set of coordinates for the known ligand, which can be provided by the CCP4 program LIBCHECK (Vagin et al., 1998). As with ARP/wARP LigandBuild, the density is masked by selected coordinate sets. After ARP/wARP for OPPF1314, Coot found seven putative ligand sites matching the expected size and shape for ADP. On visual inspection, several were found to be protein structure missing from the model, but the first and second sites ordered on the density correlation corresponded to the two nucleotide sites. The fit to density was then optimized using Coot's real-space option.
7.7. Summary of model rebuilding
From Tables 8 and 9, it is noticeable that the success of ARP/wARP varies substantially depending on how the positioned model was obtained. For target BA1563, for example, ARP/wARP was able to rebuild about half of the model using the output of MrBUMP, while very little could be rebuilt from the other putative solutions. Both MrBUMP and BALBES used 1ufv as the template and MOLREP for MR. At the workshop, for target BA3935_2 ARP/wARP rebuilt about 80% of the model from MrBUMP, but only about 50% of the model from BALBES. Post mortem analysis showed that the difference was that MrBUMP carried out 30 cycles of REFMAC before ARP/wARP. This step has subsequently been activated in BALBES.
We have not carried out a systematic investigation of the factors which are important for successful rebuilding and the differences noted may be coincidental. However, where there is no clear-cut MR result, there is a clear advantage in taking several putative solutions through to model rebuilding. In the BA3935_2 example, both templates 1dhp and 1s5t have 42% sequence identity with the target and both should be tried. The high-resolution limit of the data for BA1563 and BA3935_2 is in both cases 2.2 Å and it is in this regime where subtle differences may affect the ARP/wARP procedure. Automated schemes are particularly useful for investigations of such multiple models.
8. Conclusions
Protein crystallography has a series of potential bottlenecks, including protein overexpression, solubility, crystallization and structure solution. Recently, rapid advances in the first three of these have been made (see other contributions to this issue). Considerable progress has been and is being made worldwide in the automation of structure analysis. The automation of image processing and data reduction was addressed at the workshop using XIA-DPA. The results obtained emphasized the importance of this step and showed that while it was in principle subject to a high level of automation, great care needs to be taken in establishing protocols and in passing the appropriate information to subsequent steps.
The MR step in the structure-solution pipeline is presently closest to full automation. Three emerging procedures were extensively tested with a high success rate and this software should be released for general use within the next year. Lessons learnt at the workshop included (i) the apparent advantage of running several cycles of Rmerge ≃ 6%, low-resolution shell ≃ 4%, high-resolution shell ≃ 35%, completeness ≃ 90%). At present, poses real problems, but this should be resolved in the near future: for crystals with diffraction to ∼2.1 Å or better may be necessary.
on correctly positioned MR models before starting the rebuilding procedure and (ii) generally trying a multimer as a model when appropriate before trying individual subunits. Based on the results obtained, automated MR procedures are likely to be successful, at least for crystals which diffract to 2.5 Å or better and satisfy a number of defined criteria (overallModels for MR should satisfy one of the following criteria.
|
For structures solved by experimental phasing, there are already modules such as the SHELX suite, AUTOSHARP and SOLVE/RESOLVE which integrate parts of the pipelines. These were tested on a couple of examples at the workshop and the importance of making full use of the Hendrickson–Lattman coefficients for low-resolution data became clear. Pirate was tested for density modification and appears to provide a more realistic set of Hendrickson–Lattman coefficients than earlier software. Pipelines being developed are still at the early stages, but considerable insight was gained as to the direction which these developments should take. The restrictions on data quality are quite different from those stated above for MR; experimental phasing is effective at substantially lower resolutions but requires much more accurate estimates of intensity, usually achieved by measuring high-multiplicity data sets.
ARP/wARP was the only automated model-building tool used extensively. It proved to be very powerful for structures with data extending to 2.3 Å or better (see Table 8). At lower resolutions problems were encountered and the feature-recognition tools within the Buccaneer and Coot programs, briefly tested during the workshop, would need to be exploited for these problems. However, crystals with diffraction limits in the greyzone (2.7–3.3 Å) still require a lot of time and effort and sometimes this effort fails. High-throughput means a limited amount of time can be dedicated to an individual project, resulting in a need for automation. We have encountered several examples, BA0592 (from the workshop set) and BA4525 (collected recently), where crystals have been obtained, data collected and a MR solution found, but the project has had to be abandoned because automated model building and failed. The more sustained effort of project-oriented research may have led to success.
Taken altogether, the outlook is very promising for modules for fully automated solution of protein crystal structures in the near future, provided the data are of sufficient quality and resolution.
Acknowledgements
The SPINE project is funded by the European Commission as SPINE (Structural Proteomics In Europe) contract No. QLG2-CT-2002-00988 under the Integrated Programme `Quality of Life and Management of Living Resources'. GW and RK are supported by the BBSRC e-HTPX grant (BEP17782) and CB, NS, MGWT and MW by CCP4. KDC is supported by The Royal Society (grant No. 003R05674). PE and AAV are funded by BBSRC grant No. 87/B17320. GNM is supported by the Wellcome Trust, FL by the EU BIOXHIT contract under the Sixth Framework Programme thematic area `Life Sciences, Genomics and Biotechnology for Health' contract No. LHSG-CT-2003-503420. ARP/wARP algorithm development at the NKI (AP, SXC) and the EMBL (VL, GL) is funded by the NIH (grant R01 GM62612-01) and the EU BIOXHIT contract. AP and SXC thank Marouane Ben Jelloul for his work in the development of pyWARP.
References
Alzari, P. M. et al. (2006). Acta Cryst. D62, 1103–1113. Web of Science CrossRef CAS IUCr Journals Google Scholar
Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F. Jr, Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). J. Mol. Biol. 112, 535–542. CrossRef CAS PubMed Web of Science Google Scholar
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235–242. Web of Science CrossRef PubMed CAS Google Scholar
Brenner, S. E., Chothia, C. & Hubbard, T. J. P. (1998). Proc. Natl Acad. Sci. USA, 95, 6073–6078. Web of Science CrossRef CAS PubMed Google Scholar
Cohen, S. X., Morris, R. J., Fernandez, F. J., Ben Jelloul, M., Kakaris, M., Parthasarathy, V., Lamzin, V. S., Kleywegt, G. J. & Perrakis, A. (2004). Acta Cryst. D60, 2222–2229. Web of Science CrossRef CAS IUCr Journals Google Scholar
Collaborative Computational Project, Number 4 (1994). Acta Cryst. D50, 760–763. CrossRef IUCr Journals Google Scholar
Cowtan, K. (1998). Acta Cryst. D54, 750–756. Web of Science CrossRef CAS IUCr Journals Google Scholar
Cowtan, K. (2000). Acta Cryst. D56, 1612–1621. Web of Science CrossRef CAS IUCr Journals Google Scholar
Cowtan, K. (2001). Acta Cryst. D57, 1435–1444. Web of Science CrossRef CAS IUCr Journals Google Scholar
Cowtan, K. (2003). IUCr Comput. Commun. Newsl. 2, 4–9. Google Scholar
Emsley, P. & Cowtan, K. (2004). Acta Cryst. D60, 2126–2132. Web of Science CrossRef CAS IUCr Journals Google Scholar
Esnouf, R. M. (1999). Acta Cryst. D55, 938–940. Web of Science CrossRef CAS IUCr Journals Google Scholar
Evans, P. (1993). Proceedings of the CCP4 Study Weekend. Data Collection and Processing, edited by L. Sawyer, N. Isaacs & S. Bailey, pp. 114–122. Warrington: Daresbury Laboratory. Google Scholar
Evans, P. (2006). Acta Cryst. D62, 72–82. Web of Science CrossRef CAS IUCr Journals Google Scholar
French, S. & Wilson, K. (1978). Acta Cryst. A34, 517–525. CrossRef CAS IUCr Journals Web of Science Google Scholar
Huennekens, F. M., Henderson, G. B., Vitols, K. S. S. & Grimsha, C. E. (1984). Adv. Enzyme Regul. 22, 3–13. CrossRef CAS PubMed Web of Science Google Scholar
Ioerger, T. R. & Sacchettini, J. C. (2002). Acta Cryst. D58, 2043–2054. Web of Science CrossRef CAS IUCr Journals Google Scholar
Kabsch, W. (1993). J. Appl. Cryst. 26, 795–800. CrossRef CAS Web of Science IUCr Journals Google Scholar
Leslie, A. (1999). Acta Cryst. D55, 1696–1702. Web of Science CrossRef CAS IUCr Journals Google Scholar
McCoy, A. J., Grosse-Kunstleve, R. W., Storoni, L. C. & Read, R. J. (2005). Acta Cryst. D61, 458–464. Web of Science CrossRef CAS IUCr Journals Google Scholar
Meier, C., Carter, L. G., Esnouf, R. M., Owens, R. J. & Stuart, D. I. (2006). In preparation. Google Scholar
Merritt, E. A. & Murphy, M. E. P. (1994). Acta Cryst. D50, 869–873. CrossRef CAS Web of Science IUCr Journals Google Scholar
Murshudov, G. N., Vagin, A. A. & Dodson, E. J. (1997). Acta Cryst. D53, 240–255. CrossRef CAS Web of Science IUCr Journals Google Scholar
Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). J. Mol. Biol. 247, 536–540. CrossRef CAS PubMed Web of Science Google Scholar
Navaza, J. (1994). Acta Cryst. A50, 157–163. CrossRef CAS Web of Science IUCr Journals Google Scholar
Otwinowski, Z. & Minor, W. (1997). Methods Enzymol. 276, 307–326. CrossRef CAS Web of Science Google Scholar
Perrakis, A., Harkiolaki, M., Wilson, K. S. & Lamzin, S. (2001). Acta Cryst. D57, 1445–1450. Web of Science CrossRef CAS IUCr Journals Google Scholar
Perrakis, A., Morris, R. & Lamzin, V. S. (1999). Nature Struct. Biol. 6, 458–463. Web of Science CrossRef PubMed CAS Google Scholar
Sauter, N., Grosse-Kunstleve, R. & Adams, P. (2004). J. Appl. Cryst. 37, 399–409. Web of Science CrossRef CAS IUCr Journals Google Scholar
Schneider, T. R. & Sheldrick, G. M. (2002). Acta Cryst. D58, 1772–1779. Web of Science CrossRef CAS IUCr Journals Google Scholar
Schwarzenbacher, R., Godzik, A., Grzechnik, S. K. & Jaroszewski, L. (2004). Acta Cryst. D60, 1229–1236. Web of Science CrossRef CAS IUCr Journals Google Scholar
Terwilliger, T. C. (2003). Methods Enzymol. 374, 22–37. Web of Science CrossRef PubMed CAS Google Scholar
Vagin, A. A., Murshudov, G. N. & Strokopytov, B. V. (1998). J. Appl. Cryst. 31, 98–102. Web of Science CrossRef CAS IUCr Journals Google Scholar
Vagin, A. A. & Teplyakov, A. (1997). J. Appl. Cryst. 30, 1022–1025. Web of Science CrossRef CAS IUCr Journals Google Scholar
Vaguine, A. A., Richelle, J. & Wodak, S. (1999). Acta Cryst. D55, 191–205. Web of Science CrossRef CAS IUCr Journals Google Scholar
Zwart, P. H., Langer, G. G. & Lamzin, V. S. (2004). Acta Cryst. D60, 2230–2239. Web of Science CrossRef CAS IUCr Journals Google Scholar
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.