research papers
The Nucleic Acid Database: A Resource for Nucleic Acid Science
aNDB, Department of Chemistry, Rutgers University, 610 Taylor Road, Piscataway, NJ 08854-8087, USA
*Correspondence e-mail: berman@adenine.rutgers.edu
The Nucleic Acid Database (NDB) distributes information about nucleic acid-containing structures. Here the information content of the database as well as the query capabilities are described. A summary of how the technology developed by this project has been used to develop other macromolecular databases is given.
1. Introduction
). By 1990, there were more than 150 structures of DNA and RNA t-RNA, and a handful of protein–nucleic acid complexes. The sample set was thus large enough to begin to ask questions about the effects of sequence and environment on the structures of these biological molecules. The vision behind the creation of the Nucleic Acid Database (NDB; Berman et al., 1992) was to establish a resource which enables researchers to easily answer these questions and to facilitate, in general, research about nucleic acid structures. This resource would be realised through the creation of a database of information about which would have robust query capabilities. Thus, not only would the users be able to extract coordinates for a particular structure, but they would also be able to select groups of structures based on particular characteristics and extract the experimental information and derived features of these structures. In addition, it was desired to be able to make correlations among different characteristics of the structures so that eventually the database would become a predictive tool.
have distinctive chemical and structural features. There are 12 conformation angles which describe the bonds interconnecting the base, sugars and phosphates, and 16 different parameters which describe the geometry of the base pairs. This potential conformational variability allows double-stranded DNA to be flexible enough to be coiled in the cell and for single-stranded RNA to fold into compact functional units. Single-crystal nucleic acid crystallography began in the 1970's, almost 20 years after the discovery of the double-helical form of DNA by Watson and Crick (Watson & Crick, 1953At the time that the NDB was established, nucleic acid structural information was found in both the Cambridge Structural Database (CSD; Allen et al., 1979) and the Protein Data Bank (PDB; Bernstein et al., 1977). Containing small and dinucleoside phosphates, the CSD had a very robust query engine but it did not provide the specialized information required to understand nucleic acid structures. The PDB contained the larger but lacked a query interface; although it now provides a browser, it is not possible to extract derived data about groups of structures. The original scope of the NDB project was to create a `value-added' database containing structural data for both RNA and DNA. The NDB is now the direct deposition site for these structures and has created a database with information about all nucleic acid-containing crystals. The ways in which the NDB is used to support research on are described here. Additionally, we describe how we have applied the technology developed by the NDB to other types of macromolecular databases.
2. Information content of the NDB
In addition to coordinate data, the NDB contains information about the experiments used to determine the structures, including crystallization information, data-collection information and summarizes these derived data and annotations. The organization of these features into a database makes it possible to gain new insights about structure using advanced techniques such as data mining.
The database also stores derived information such as valence geometry, torsion angles, base morphology parameters and intermolecular contacts. Further annotation includes information about the overall structural features. These include the conformational classes, special structural features, biological functions, and crystal packing classifications. Table 1
|
3. Validation procedures for nucleic acids
Structures are added to the database via two routes: direct deposition (nucleic acids) and post processing of PDB files (protein–nucleic acid complexes). All data are transformed into mmCIF files which allows them to be checked automatically against the mmCIF dictionary (Bourne et al., 1997). This creates a uniform archive to ensure reliable query results. Structure checking is accomplished using a suite of programs that verify valence geometry, torsion angles, intermolecular contacts and The dictionaries used for checking the structures were developed from analyses of high-resolution small-molecule structures from the CSD (Clowney et al., 1996; Gelbin et al., 1996). The torsion-angle ranges were analyzed from high-resolution nucleic acid structures (Schneider et al., 1997). One important outgrowth of this work was the creation of the and restraints that are now in common use for crystallographic of nucleic acid structures (Parkinson et al., 1996).
4. The characteristics and uses of the database
4.1. Basic features
The core of the NDB project (Fig. 1) is a relational database in which all of the primary and derived data items are organized into tables. At present there are over 90 tables in the NDB, with each containing five to 20 items. These tables contain both experimental and derived information. Example tables include: the citation table, which contains all the items that are contained in literature references; the cell_dimension table, which contains all items related to crystal data; and the refine_parameters table, which contains the items that describe the Interaction with the database is a two-step process (Fig. 2). In the first step, the user defines the selection criteria by combining different database items. As an example, one could select all B-DNA structures with resolution better than 2.0 Å, R factor better than 0.17 and that they were determined by Dickerson, Kennard or Rich. The logic for this constraint is shown in Table 2.
|
Once the structures that meet the constraint criteria have been selected, reports may be written using a combination of table items. For any set of chosen structures, a large variety of reports may be created. For the example given above, a crystal data report or a backbone torsion-angle report can be generated easily. Or the user could write a report that lists the twist values for all CG steps together with statistics, including mean, median and range of values. The constraints used for the reports do not have to be the same as those used to select the structures.
The NDB can also be used to create graphical reports. It is possible to produce pie charts, scattergrams and histograms describing any aspect of a selected group of structures. These capabilities were put to full use in deriving the ranges of torsion angles for different types of DNA helices (Schneider et al., 1997). Fig. 3 shows some examples of the variety of reports one can make about torsion angles.
Another very popular and useful report is the NDB Atlas report page. Atlas pages are created directly from the NDB database for a particular structure and contain summary, crystallographic and experimental information about the structure, a molecular view of the biological unit and a crystal packing picture (Fig. 4). The Atlas section of the NDB WWW site contains Atlas entries for all of the structures in the database organized by structure type.
4.2. Query capabilities
4.2.1. Character menu
The most direct access to the database is through the use of SQL commands. However, given that this is a specialist language, a character menu has been constructed which allows access to all of the tables in the database. Queries and reports are constructed menu by menu. Both query and report constraints can be saved in a `command' file which allows the user to access, revise and edit these constraint and report definitions at any time. These command files facilitate making multiple reports about a particular set of structures quickly and easily. This method is used to generate the summary reports produced regularly for the NDB WWW Archives.
4.2.2. WWW interface
The WWW interface was designed to make the query capabilities of the NDB as widely accessible as possible. To highlight the special features of NDB, the interface operates in two modes. In the Quick Search/Quick Report mode, there are several items, including structure ID, author, classification and special features, which can be limited either by entering text in a box or by selecting an option from the pull-down menu. Any combination of these items may be used to constrain the structure selection. If none are used, the entire database will be selected. After selecting `Execute Selection', the user will be presented with a list of IDs and descriptors of the structures that match the desired conditions. Several viewing options for each structure in this list are possible. These include retrieving the coordinate files in either mmCIF or PDB format, retrieving the coordinates for the biological unit, viewing the structure with RasMol (Sayle & Milner-White, 1995), or viewing an NDB Atlas page.
Preformatted Quick Reports can then be generated for the structures in this results list. The user selects a report from a list of options, and the report is created automatically. For the structures in NDB, there are 12 different types of Quick Reports as shown in Table 3. These reports are particularly convenient for being able to quickly produce reports based on derived features such as torsion angles and base morphology.
|
In the Full Search/Full Report mode, it is possible to access most of the tables in the NDB and create more complex queries. Instead of selecting from a limited number of options, the user builds a search by selecting the tables, and then the items, that contain the desired features. These queries can use Boolean and logical operators to make complex queries. An example Full Search, finding transcription repressors that bind to DNA containing the step T G, is shown in Fig. 5.
After selecting structures using Full Search, a variety of reports can be written. The user selects the items that are to be displayed in a report by going through the tables that include the desired information. Multiple reports can be generated for the same group of selected structures; one report could list the DNA sequences, and then another report could present the base morphology of the bases in these sequences. Shortly, it will be possible to draw scattergrams and histograms interactively using JAVA applets to dynamically analyze the structural features.
Another variation on this query would be to select the protein–DNA complexes which contain a particular sequence pattern, e.g. ACA, and then write a report which gives the binding mode of these structures. A report showing the binding mode demonstrates that most of these structures are regulatory. To further refine the report, the user can also include the type of regulatory protein. The report that is produced is shown in Fig. 6.
Experimental features may be explored by constructing a query to search for structures with cell dimensions within a particular range. It is also possible to search for some aspects of crystallization conditions, although the information collected on crystallization by the NDB is more limited than that found at the Biological Macromolecule Crystallization Database (BMCD) (Gilliland et al., 1994).
5. NDB Archives
The NDB Archives contain a large variety of information and tables useful for researchers. These include a variety of prepared reports that are sorted according to structure type (Fig. 7). The dictionaries of standard geometries of are here as well as parameter files for X-PLOR (Brünger, 1992). The ftp server provides coordinates for the and biological units in PDB and mmCIF formats, files, and coordinates for nucleic acid structures determined by NMR.
6. Other applications of NDB technology
The underlying technology of the NDB has also been used to create relational databases for other classes of macromolecules. The WWW interface, called Structure Finder, is designed so that specific tables of information can be turned on or off as appropriate for a particular structural class. The decision to turn a table on or off depends on two things: the quality of the underlying data and the appropriateness of that table to the structure class in the database. Four databases created by the NDB Project are described here and can be found under Structure Finder at the NDB Biological Structure Resource (http://ndbserver.rutgers.edu/). Tutorials for all of the Structure Finder databases are also available at this site.
6.1. Proteins Plus
This database is created from all the files in the current PDB and is updated regularly. MAXIT (Feng et al., 1997), a software tool developed by the NDB, extracts data from PDB files which are loaded into the database tables. A drawback of this database is that the PDB format has changed over the years so that abstracting information from these files is not entirely reliable. To remedy this, the NDB and the Macromolecular Structure Database (MSD) group at EBI are working to put the PDB files into a uniform format and add any missing data items. In the cases in which files have been recurated, the newly processed file is available from Proteins Plus. Once all the files have been remediated, many additional tables will be available for searching.
The Proteins Plus database can be searched using Quick Search/Quick Reports and Full Search/Full Reports, which are used in the same way as for the NDB. An example of a Proteins Plus query and report session might be to search for all myoglobin structures and then to create a report of all crystal data or another which lists the positions of the helices (Fig. 8).
6.2. DNA binding proteins
All proteins which bind to DNA have been fully curated and annotated and placed into a database. The database includes protein–DNA complexes as well as proteins that bind to DNA but do not have DNA in the crystal. The functions of these proteins are available as both search targets and report content.
7. Summary
The NDB Project has evolved over the years. The original NDB contained fewer than 100 crystal structures of
there are now over 700. In addition, the project has created a suite of speciality databases and technologies that will allow for the evolution of these databases. The strength of the search engines developed by this project is that they allow for rapid selection of structures by a wide variety of criteria. Once selected, the coordinates of these structures, along with the tabular and graphical report capabilities of the NDB, can be used to understand a large number of the characteristics of these structures. The curated and reliable files that are provided by the NDB can also be used by other independent programs.8. Access
The home for the Nucleic Acid Database can be found on the NDB Biological Structure Resource Home page (http://ndbserver.rutgers.edu/). On this page are pointers to the NDB and Structure Finder, as well as to à la mode (Clowney & Westbrook, 1997), which is a database of ligands and monomer units, and to the mmCIF WWW site. In addition to backup mirror sites at Rutgers University, the NDB is mirrored at the European Bioinformatics Institute (Europe), NIBH-AIST (Japan), and San Diego Supercomputer Center (US), with additional public mirrors currently in development. These mirrors are kept synchronous by using software tools developed by the project.
Acknowledgements
This work has been funded by the National Science Foundation (BIR 95 10703).
References
Allen, F. H., Bellard, S., Brice, M. D., Cartright, B. A., Doubleday, A., Higgs, H., Hummelink, T., Hummelink-Peters, B. G., Kennard, O., Motherwell, W. D. S., Rodgers, J. R. & Watson, D. G. (1979). Acta Cryst. B35, 2331–2339. CSD CrossRef CAS IUCr Journals Web of Science
Berman, H. M., Olson, W. K., Beveridge, D. L., Westbrook, J., Gelbin, A., Demeny, T., Hsieh, S.-H., Srinivasan, A. R. & Schneider, B. (1992). Biophys. J. 63, 751–759. CrossRef PubMed CAS Web of Science
Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). J. Mol. Biol. 112, 535–542. CSD CrossRef CAS PubMed Web of Science
Bourne, P., Berman, H. M., Watenpaugh, K., Westbrook, J. D. & Fitzgerald, P. M. D. (1997). Methods Enzymol. 277, 571–590. CrossRef PubMed CAS Web of Science
Brünger, A. T. (1992). X-PLOR, Version 3.1, A System for X-ray Crystallography and NMR, Yale University Press, New Haven, CT, USA.
Clowney, L., Jain, S. C., Srinivasan, A. R., Westbrook, J., Olson, W. K. & Berman, H. M. (1996). J. Am. Chem. Soc. 118, 509–518. CrossRef CAS Web of Science
Clowney, L. & Westbrook, J. D. (1997). à la mode: A Ligand and Monomer Object Data Environment, NDB-241, Rutgers University, New Brunswick, NJ, USA.
Feng, Z., Hsieh, S.-H., Gelbin, A. & Westbrook, J. (1997). MAXIT: Macromolecular Exchange and Input Tool, NDB-220, Rutgers University, New Brunswick, NJ, USA.
Gelbin, A., Schneider, B., Clowney, L., Hsieh, S.-H., Olson, W. K. & Berman, H. M. (1996). J. Am. Chem. Soc. 118, 519–528. CrossRef CAS Web of Science
Gilliland, G. L., Tung, M., Blakeslee, D. M. & Ladner, J. E. (1994). Acta Cryst. D50, 408–413. CrossRef CAS Web of Science IUCr Journals
Grzeskowiak, K., Yanagi, K., Privé, G. G. & Dickerson, R. E. (1991). J. Biol. Chem. 266, 8861–8883. CAS PubMed Web of Science
Lavery, R. & Sklenar, H. (1988). J. Biomol. Struct. Dyn. 6, 63–91. CrossRef CAS PubMed
Parkinson, G., Vojtechovsky, J., Clowney, L., Brünger, A. T. & Berman, H. M. (1996). Acta Cryst. D52, 57–64. CrossRef CAS Web of Science IUCr Journals
Sayle, R. & Milner-White, J. E. (1995). Trends Biochem. Sci. 20, 374. CrossRef PubMed Web of Science
Schneider, B., Neidle, S. & Berman, H. M. (1997). Biopolymers, 42, 113–124. CrossRef PubMed CAS Web of Science
Scott, W. G., Finch, J. T. & Klug, A. (1995). Cell, 81, 991–1002. CrossRef CAS PubMed Web of Science
Watson, J. D. & Crick, F. H. C. (1953). Nature (London), 171, 737–738. CrossRef PubMed CAS Web of Science
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.