November 1998 issue
Databases for macromolecular crystallography
Proceedings of the CCP4 study weekend, January 1998
The rapid growth of the World Wide Web provides major new opportunities for distributed databases, especially in macromolecular science. A new generation of technology, based on structured documents (SD) and XML, is being developed which will integrate documents and data in a seamless manner.
The Protein Data Bank (PDB) at Brookhaven National Laboratory, a database containing experimentally determined three-dimensional structures of proteins, nucleic acids and other biological macromolecules, with approximately 8000 entries, is described.
A summary of macromlecular structure databases developed to date. The authors own work indicates that data are reported inconsistently and this should be addressed in the future.
A description is given of how the Nucleic Acid Database (NDB) is used to study nucleic acids. In addition, the way in which the technology developed by the NDB project has been extended to macromolecules in general is summarized.
The importance of validation techniques in X-ray structure determination and their relation to refinement procedures are discussed, with particular reference to atomic resolution structures. The requirements of deposition and publication, and the role of validation tools in this are analysed. The need for a rigorously defined file format is emphasized.
A description of new analytical software tools and WWW servers for studying protein sequence, structure and function is presented.
The Structural Classification of Proteins (SCOP) database is described. It provides a detailed and comprehensive description of the relationships of all known protein structures and can be used as a source of data to calibrate sequence search algorithms and for the generation of population statistics on protein structures.
The CATH database of protein domain structures classifies structures according to their (C)lass, (A)rchitecture, (T)opology or fold and (H)omologous family. Although the protocol used is mostly automatic, manual inspection is used to check assignments at some critical stages. Described in this article is a recently established facility to search the database with the coordinates of a newly determined structure.
Databases of protein structural domains (DDBASE), aligned homologous protein structures (HOMSTRAD) and structurally aligned protein superfamilies (CAMPASS) are available on the WWW.
Analysis of data from the IsoStar library shows that many hydrophobic groups exhibit strikingly strong directional preferences in their intermolecular interactions. These directional preferences may need to be taken into account in parameterizing the next generation of protein–ligand docking programs.
The reliability and transferability of M—L bond lengths and L—M—L bond angles from crystal structure is considered in the light of the utility of tables of `typical' bond lengths in transition-metal complexes.
The Heavy-Atom Data Bank (HAD) described contains coordinates of heavy-atom sites derived from multiple isomorphous derivatives used in protein crystallography. HAD contains information on crystallization conditions and protein binding sites that will be of value in the preparation of heavy-atom derivatives for use in preparation of isomorphous derivatives in the method of isomorphous replacement.