Received 14 June 2004
Automated and accurate deposition of structures solved by X-ray diffraction to the Protein Data Bank
Huanwang Yang,a Vladimir Guranovic,a Shuchismita Dutta,a Zukang Feng,a Helen M. Bermana* and John D. Westbrooka
The RCSB Protein Data Bank (PDB) has a number of options for deposition of structural data and has developed software tools to facilitate the process. In addition to ADIT and the PDB Validation Suite, a new software application, pdb_extract, has been designed to promote automatic data deposition of structures solved by X-ray diffraction. The pdb_extract software can extract information about data reduction, phasing, molecular replacement, density modification and refinement from the output files produced by many X-ray crystallographic applications. The options, procedures and tools for accurate and automated PDB data deposition are described here.
The number of structure determinations of biological molecules has increased dramatically during the last several years as a result of improved methods for protein production, crystallization, data collection, phase determination and refinement. An important focus of many current international initiatives in structural genomics is the creation of automated high-throughput pipelines for structure determination and analysis (Burley et al., 1999). Thus, the number of structures deposited to the Protein Data Bank (Berman et al., 2000, 2003; Bernstein et al., 1977) will continue to grow rapidly.
In addition to the increased number of structures, information about the experiments that produced these structures is also increasing, although at a more modest rate (Fig. 1). Whereas a minority of depositions used to include structure factors, in 2003 almost 80% of crystal structure depositions included these data. The PDB Exchange Dictionary, which is an expanded form of the macromolecular Crystallographic Information File (mmCIF; Bourne et al., 1997), now includes more than 4000 potential data items (http://deposit.pdb.org/mmcif ). Even though all of these data items might not be appropriate for any one structure, only a small fraction of what is appropriate is currently represented in a typical PDB file.
| || Figure 1 |
The number of unique data items deposited to the PDB by year. The turquoise, blue and black bars represent the minimum, maximum and average number of data items for each structure, respectively.
For the continued improvement of structure-determination methods and for data-mining applications it is important that the protein-structure information deposited into the PDB is as complete and accurate as possible. Information about data collection, phasing and refinement is fully recorded in the output files produced by the software applications used in structure determination. Thus, it would be useful to harvest this information for deposition along with the coordinates and experimental data as described earlier (Henrick, 1998; Winn, 1999) and implemented in the CCP4 package (Collaborative Computational Package, Number 4, 1994). Web-based protein crystallography project-information systems have also been developed that allow users to track the progress of a crystal structure determination (Haebel et al., 2001; Harris & Jones, 2002).
With the advent of high-throughput X-ray crystallography and the expected higher rate of data deposition, the RCSB PDB has developed various tools for automated and accurate structure deposition. ADIT is available both as a web-based tool and a standalone editor for assembling, editing, validating and depositing structural data. The application pdb_extract can extract information from the output of standard crystallographic programs at each step of the structure-determination process and merge the information into mmCIF files that are ready for validation and deposition. The PDB Validation Suite (Westbrook et al., 2003) creates structure-validation reports and calculates derived information that could be used for assessing the quality of a structure or for monitoring progress during refinement. Since these tools utilize the PDB mmCIF Exchange Dictionary for data exchange, their use in structure deposition also facilitates annotating and processing the data.
In addition to the coordinates and structure-factor files, information regarding the source and sequence of the macromolecules in the structure, data-collection, data-processing, structure-solution, refinement and citation information are also required for data deposition. Thus, the deposition process consists of collecting, assembling and entering all this information and finally submitting it to the PDB. It is highly recommended that the files be validated before submission. We have developed tools for all of these steps so as to make the deposition process as automatic as possible while ensuring the accuracy and integrity of the data.
Deposition tools include ADIT, pdb_extract and the PDB Validation Suite. These programs can be used either independently or in an integrated way, as shown in Fig. 2. Each of these tools is described in the sections below.
| || Figure 2 |
Ways to complete a PDB deposition using RCSB PDB tools. The flow chart in the dashed box is for the data extraction using pdb_extract. The black arrows (solid lines) show the recommended way for PDB deposition using pdb_extract and ADIT. The output files generated from each crystallographic step are extracted and merged into two mmCIF files which can be uploaded to ADIT for adding non-electronically captured information, validation and submission to the PDB. The red arrows (dotted lines) show an alternative method for PDB deposition using pdb_extract. The output mmCIF files generated by pdb_extract can either be validated by a command line or by the standalone version of ADIT. The validated mmCIF files can be directly sent to the PDB by ftp (ftp://pdb.rutgers.edu ) or e-mail (firstname.lastname@example.org).
ADIT (http://deposit.pdb.org/adit ) is an integrated software system for assembling, editing, checking, validating and depositing structural data to the PDB. The functionality of this mmCIF editor has been described elsewhere (Berman et al., 2000). In an ADIT session, three operations can be performed: a data-format pre-check, validation and actual deposition. Optimally, all these steps should be executed for a structure deposition. In the data-format pre-check step, the format of the coordinate data file is checked to ensure that it conforms to either PDB or mmCIF format. In the validation step, the data are checked for consistency with known standards and a report is created as described below. If any major errors or warnings are highlighted here, the structure should be corrected accordingly before proceeding further. During deposition, all categories in ADIT should be completed appropriately and checked before submitting the structure to the PDB. Upon successful completion of a structure deposition, a PDB ID is automatically assigned to the entry and displayed in the deposition window.
In addition to the web-based version (http://deposit.pdb.org/adit ), there is a standalone Linux workstation version of ADIT that allows the user to prepare, check and validate structures on a local computer before actually submitting the file to the PDB. Except for actual deposition, this version contains all the functionalities of the web-based ADIT and can also be used to record information during different stages of structure solution or monitor the progress of a refinement. Once a file has been prepared and saved using the standalone version of ADIT, it can be uploaded and submitted via the web version.
The pdb_extract application automatically extracts information from output and log files generated by standard software used in X-ray crystallographic structure determination for data collection, data reduction, protein phasing, molecular replacement, density modification and refinement. Since multiple software packages may be used for each step of structure determination, pdb_extract has been designed to accommodate the researcher's preferences in software applications. Thus, this program can extract information from the output files from a number of commonly used applications at all stages of structure determination. Table 1 lists the software applications that are supported by pdb_extract and the information that is currently extracted. The PDB Exchange Dictionary contains definitions of all the data items that are extracted. Once relevant information is extracted from these files, they are merged to create two mmCIF data files: one with structure factors and the other with details of the structure including its coordinates.
There are three versions of pdb_extract: a web interface (http://pdb-extract.rutgers.edu ), a standalone application (available from RCSB PDB) and part of the CCP4 package (version 5.00 and above). In all cases, the program can extract the sequence information of all polymers (protein or nucleic acid) present in the structure from the coordinate file. The sequence should be examined and any residues which were present in the crystallized molecule(s) but not modeled owing to missing electron density should be inserted here. Also, the sequence of any residue modeled as Gly or Ala owing to missing side-chain density should also be corrected.
In the web version, the sequence information automatically populates the data item corresponding to the sequence of macromolecular components. These data eventually form the entity_poly_seq category in mmCIF format and the SEQRES record in PDB format files. When using the standalone version or pdb_extract as part of the CCP4 package, the program creates two text files while extracting the sequence information: a data-template file (called data_template.text) and a script-input file (called log_script.inp). The data-template file contains the sequence information and also has fields for adding non-electronically produced information such as author name, citation, release status, structure title, related entries, protein source, protein-expression details, molecular names, crystallization conditions, crystal properties, radiation source, temperature and data-collection protocols. These fields may either be completed here or later when using the ADIT editor for deposition. The advantage of completing these fields in the data-template file is obvious when preparing multiple related structures for deposition. In this case, many of these fields are identical. Thus, instead of manually typing of all this information in ADIT for each deposition, the information in these fields can be copied when editing the respective data-template files. It should be noted here that the web interface of pdb_extract does not provide fields for including the non-electronically produced information. Thus, it should be used in conjunction with ADIT to produce fully populated data files that can be deposited to the PDB.
In addition to the coordinates, structure-factor files and sequence information, pdb_extract requires the names of the programs used for structure determination, along with their appropriate output and log files representing the final or best trial for that step of structure determination. In the CCP4i and web interface of pdb_extract the program names can be selected from lists provided and the appropriate output and log files can be uploaded. The standalone version and pdb_extract as part of the CCP4 package can both be executed either from the command line or using a script. When using the script method, the script-input file (log_script.inp, generated along with the data-template file) is used to list the names of the applications used for structure solution along with their output and log files, while in the command-line method all this information is directly provided at the command line using specific arguments described in the documentation for this program.
In future, when information about protein production and crystallization is produced using computer-controlled equipment, pdb_extract will be extended to automatically harvest this information too.
The PDB Validation Suite (Westbrook et al., 2003; http://deposit.pdb.org/validate ) creates reports based upon the following information: close contacts between all atoms both within the asymmetric unit and between symmetry-related molecules, covalent bond length and angle deviations (Clowney et al., 1996; Gelbin et al., 1996; Engh & Huber, 1991), chirality errors with respect to IUBMB and IUPAC conventions (Li'ebecq, 1992; Markley et al., 1998), ligand and atom nomenclature according to the chemical component dictionary (ftp://ftp.rcsb.org/pub/pdb/data/monomers/het_dictionary.txt ), sequence comparison and water distances. The reports produced by the PDB Validation Suite are output in a plain text file and in PostScript files showing both the asymmetric unit and crystal packing. Presently, validation reports from SFCHECK (Vaguine et al., 1999), PROCHECK (Laskowski et al., 1993) and NUCHECK (Feng et al., 1998) are also produced. Reports based on other validation programs like WHAT_CHECK (Hooft et al., 1996) and MolProbity (Lovell et al., 2003) may be included here in future versions.
Using the tools described here can make deposition of structural data produced by X-ray diffraction experiments quick, easy, automated, complete and error-free. Fig. 2 illustrates the different ways to complete a PDB deposition using these tools developed by the RCSB.
The method used until now has been to upload coordinate and structure-factor files into ADIT. After (optional) validation of the files, any information not available in the uploaded files is manually typed into ADIT by the depositor; the data file is then ready for deposition. An improved method is to use pdb_extract to automatically retrieve data from the output and log files of structure-determination programs for deposition into the PDB. Data extraction can be performed either using the pdb_extract web interface (shown by the black arrows in Fig. 2) or using the standalone version (shown by the red arrows in Fig. 2). Additionally, pdb_extract may also be used as part of the CCP4 package. The two mmCIF files produced by all these methods can be imported to the ADIT web interface for online validation and submission. Alternatively, after validation the user can send the two mmCIF files produced by pdb_extract to the PDB using ftp (ftp://pdb.rutgers.edu ) or e-mail (email@example.com). The PDB ID for such an e-mail or ftp deposition is usually assigned within the next working day.
The advantage of using pdb_extract is that it reduces manual editing and the data are less likely to contain errors and inconsistencies. It also allows the depositor to easily capture detailed information regarding the structure determination, which leads to a more complete deposition. Optimal use of the data-template file is helpful in efficiently preparing multiple related depositions. Since the files deposited are created using software based upon the PDB Exchange Dictionary, the annotation process is also easier and takes much less time.
Here, we use an example to discuss a few ways of using pdb_extract to deposit a set of coordinates and structure-factor data into the PDB. In the example, a single crystal was used to collect data at three wavelengths (e.g. inflection, peak, remote edge) for a multiple anomalous diffraction (MAD; Hendrickson, 1991) experiment. The data were indexed and scaled using HKL2000 (Otwinowski & Minor, 1997). All three reflection-data files were used for phase determination and phase refinement using SOLVE (Terwilliger & Berendzen, 1999) followed by density modification using RESOLVE (Terwilliger, 2000). The final structure refinement was performed using the reflection data collected at the inflection edge (infl.cv) by CNS (Brünger et al., 1998).
The relevant output and log files generated by each of the programs used in this example include three reflection data files (scalepack1.sca, scalepack2.sca, scalepack3.sca) and three log files (scalepack1.log, scalepack2.log, scalepack3.log) generated by HKL2000, one log file (solve.prt) containing phasing statistics and a PDB file (ha.pdb) containing heavy-atom coordinates (Se in this case) generated by SOLVE, one log file (resolve.log) containing statistics after density modification by RESOLVE and one mmCIF file (cns.cif) containing atomic coordinates and refinement statistics generated by CNS.
The first step is to upload the coordinate file (cns.cif) into the pdb_extract web interface. This automatically extracts the sequences of all macromolecules present in the structure. Here, the web-interface window is split into two frames. The top frame is used for collecting information about the structure-factor file(s) and statistics related to data processing, while the bottom frame is for the experimental details and coordinates. Names of the applications are selected and the appropriate output and log files are added to both the frames. Thus, the structure factors for final refinement (infl.cv) and the reflection data files used for phasing (scalepack1.sca, scalepack2.sca, scalepack3.sca) along with their corresponding log files (scalepack1.log, scalepack2.log, scalepack3.log) are uploaded in the top frame. The program names and files uploaded in the bottom frame include SOLVE (with solve.prt, ha.pdb), RESOLVE (with resolve.log) and CNS (with cns.cif). The sequence information displayed should be corrected or completed as necessary. Clicking the submit buttons in both frames produce two mmCIF files: one containing the structure factors and the other containing details of the structure including the coordinates. Fig. 3 shows the data-item correspondence between the merged mmCIF and the header section of the PDB file.
| || Figure 3 |
An example of the correspondence between mmCIF file and PDB formatted data. The left, middle and right columns are the crystallographic data names, the data in mmCIF format and the corresponding PDB formatted data.
The mmCIF files produced here should be uploaded into the ADIT web interface for validation. The user can then manually add the non-electronically captured information such as the author names, citation information, deposition status and submit the completed file to the PDB.
The first step here is to create a data-template file (called data_template.text) and script-input file (called log_script.inp) from the final coordinate file using the command extract -cif cns.cif. The protein sequence extracted from the coordinate file is written to the data_template.text file. It should be examined and corrected as necessary and non-electronically produced information such as author names and citation information can be included here. The log_script.inp file is then edited to enter the names of the applications used (HKL2000, SOLVE, RESOLVE and CNS) and the appropriate log and output files (as described above). The name of the data-template file (data_template.text) is also included in the log_script.inp file. After completing both these files, the program is run using the command extract -ext log_script.inp. This produces two files which are similar to those generated by the pdb_extract web interface (see §3.1). However, if any non-electronically generated information was included in data_template.text, they are carried over to the output mmCIF file.
The two mmCIF files produced by pdb_extract may either be uploaded into a standalone workstation version of ADIT for validation of the files. Alternatively, if the standalone version of pdb_extract has been installed, the mmCIF file containing structural details and coordinates can be validated using the command validation-v8 -f example.mmCIF -o 2 -public -exchange -adit. In both cases a validation report is generated, which should be carefully examined. Any errors reported here should be corrected before final deposition using ADIT. Alternatively, the validated mmCIF files can be also be submitted to the PDB via ftp or e-mail.
Procedures and tools have been developed by the RCSB PDB to facilitate data deposition to the PDB archives. Information about structure determination can be automatically extracted from many crystallographic applications and merged into mmCIF format files ready for validation and deposition. This process produces more complete and reliable files, while reducing the human effort involved in data deposition and data processing. The procedures described here lend themselves to single-structure depositions as well as to high-throughput depositions of multiple structures.
The source code for pdb_extract, the PDB validation suite and ADIT are available under an Open Source license. Table 2 shows the location of the software, documentation and web servers described here. pdb_extract is also found in the CCP4 package (version 5.0 and above).
The RCSB Protein Data Bank (RCSB PDB) is operated by Rutgers, The State University of New Jersey, the San Diego Supercomputer Center at the University of California, San Diego (SDSC/UCSD) and the Center for Advanced Research in Biotechnology (CARB)/UMBI/NIST - three members of the Research Collaboratory for Structural Bioinformatics (RCSB). The RCSB PDB is supported by funds from the National Science Foundation, the National Institute of General Medical Sciences, the Office of Science, Department of Energy, the National Library of Medicine, the National Cancer Institute and the National Center for Research Resources, the National Institute of Biomedical Imaging and Bioengineering and the National Institute of Neurological Disorders and Stroke. RCSB PDB is a member of the wwPDB.
Berman, H. M., Henrick, K. & Nakamura, H. (2003). Nature Struct. Biol. 10, 980.
Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N. & Bourne, P. E. (2000). Nucleic Acids Res. 28, 235-242.
Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F. Jr, Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). J. Mol. Biol. 112, 535-542.
Bourne, P. E., Berman, H. M., Watenpaugh, K., Westbrook, J. D. & Fitzgerald, P. M. D. (1997). Methods Enzymol. 277, 571-590.
Brünger, A. T., Adams, P. D., Clore, G. M., DeLano, W. L., Gros, P., Grosse-Kunstleve, R. W., Jiang, J.-S., Kuszewski, J., Nilges, M., Pannu, N. S., Read, R. J., Rice, L. M., Simonson, T. & Warren, G. L. (1998). Acta Cryst. D54, 905-921.
Burley, S. K., Almo, S. C., Bonanno, J. B., Capel, M., Chance, M. R., Gaasterland, T., Lin, D., Sali, A., Studier, F. W. & Swaminathan, S. (1999). Nature Genet. 23, 151-157.
Clowney, L., Jain, S. C., Srinivasan, A. R., Westbrook, J., Olson, W. K. & Berman, H. M. (1996). J. Am. Chem. Soc. 118, 509-518.
Collaborative Computational Project, Number 4 (1994). Acta Cryst. D50, 760-763.
Cowtan, K. (1994). Jnt CCP4/ESF-EACBM Newsl. Protein Crystallogr. 31, 34-38.
Engh, R. A. & Huber, R. (1991). Acta Cryst. A47, 392-400.
Evans, P. R. (1997). Jnt CCP4 /ESF-EACBM Newsl. Protein Crystallogr. 33, 22-24.
Feng, Z., Westbrook, J. & Berman, H. M. (1998). Report NDB-407. Rutgers University, New Brunswick, NJ, USA.
Furey, W. & Swaminathan, S. (1997). Methods Enzymol. 277, 590-620.
Gelbin, A., Schneider, B., Clowney, L., Hsieh, S.-H., Olson, W. K. & Berman, H. M. (1996). J. Am. Chem. Soc. 118, 519-528.
Haebel, P. W., Arcus, V. L., Baker, E. N. & Metcalf, P. (2001). Acta Cryst. D57, 1341-1343.
Harris, M. & Jones, T. A. (2002). Acta Cryst. D58, 1889-1891.
Hendrickson, W. A. (1991). Science, 254, 51-58.
Henrick, K. (1998). CCP4 Newsl. Protein Crystallogr. 35, 13-16.
Hooft, R. W. W., Vriend, G., Sander, C. & Abola, E. E. (1996). Nature (London), 381, 272-272.
Kissinger, C. R., Gehlhaar, D. K. & Fogel, D. B. (1999). Acta Cryst. D55, 484-491.
La Fortelle, E. de & Bricogne, G. (1997). Methods Enzymol. 276, 472-494.
Laskowski, R. A., MacArthur, M. W., Moss, D. S. & Thornton, J. M. (1993). J. Appl. Cryst. 26, 283-291.
Li'ebecq, C. (1992). Editor. Biochemical nomenclature and related documents: a compendium prepared for the Committee of Editors of Biochemical Journals, 2nd ed. North Carolina: Portland Press.
Lovell, S. C., Davis, I. W., Arendall, W. B. III, de Bakker, P. I. W., Word, J. M., Prisant, M. G., Richardson, J. S. & Richardson, D. C. (2003). Proteins, 50, 437-450.
Markley, J. L., Bax, A., Arata, Y., Hilbers, C. W., Kaptein, R., Sykes, B. D., Wright, P. E. & Wüthrich, K. (1998). J. Biomol. NMR, 12, 1-23.
Murshudov, G. N., Vagin, A. A., Lebedev, A., Wilson, K. S. & Dodson, E. J. (1999). Acta Cryst. D55, 247-255.
Navaza, J. (1994). Acta Cryst. A50, 157-163.
Otwinowski, Z. & Minor, W. (1997). Methods Enzymol. 276, 307-326.
Pflugrath, J. W. (1999). Acta Cryst. D55, 1718-1725.
Sheldrick, G. & Schneider, T. (1997). Methods Enzymol. 277, 319-343.
Terwilliger, T. C. (2000). Acta Cryst. D56, 965-972.
Terwilliger, T. C. & Berendzen, J. (1999). Acta Cryst. D55, 849-861.
Tronrud, D. E. (1997). Methods Enzymol. 277, 306-319.
Vagin, A. & Teplyakov, A. (2000). Acta Cryst. D56, 1622-1624.
Vaguine, A. A., Richelle, J. & Wodak, S. J. (1999). Acta Cryst.D55, 191-205.
Weeks, C. M., Blessing, R. H., Miller, R., Mungee, S., Potter, S. A., Rappleye, A., Simith, G. D., Xu, H. & Furey, W. (2002). Z. Kristallogr. 217, 686-693.
Weeks, C. M. & Miller, R. (1999). Acta Cryst. D55, 492-500.
Westbrook, J., Feng, Z., Burkhardt, K. & Berman, H. M. (2003). Methods Enzymol. 374, 370-385.
Winn, M. (1999). CCP4 Newsl. Protein Crystallogr. 37.