1078 Acta Cryst. (1998). D54, 1078±1084 Protein Data Bank (PDB): Database of Three-Dimensional Structural Information of Biological Macromolecules

The Protein Data Bank (PDB) at Brookhaven National Laboratory, is a database containing experimentally determined three-dimensional structures of proteins, nucleic acids and other biological macromolecules, with approximately 8000 entries. Data are easily submitted via PDB's WWW-based tool AutoDep, in either mmCIF or PDB format, and are most conveniently examined via PDB's WWW-based tool 3DB Browser.


Introduction
The Protein Data Bank (PDB) at Brookhaven National Laboratory (BNL), is a database containing experimentally determined three-dimensional structures of proteins, nucleic acids and other biological macromolecules (Abola et al., 1987(Abola et al., , 1997Bernstein et al., 1977). The PDB has a 26-year history of service to a global community of researchers, educators and students in a wide variety of scienti®c disciplines. The archives contain atomic coordinates, citations, primary and secondary structure information, crystallographic structure experimental data, as well as hyperlinks to many other scienti®c databases. Scientists around the world contribute structures to the PDB and use it on a daily basis. The common interest shared by this community is a need to access information that can relate the biological functions of macromolecules to their three-dimensional structures.
The PDB has introduced substantial enhancements to data deposition and management, and user access in the past four years. The PDB browser, ®rst introduced on PC and UNIX systems and later via the World Wide Web (WWW), allows researchers to search and retrieve information from the PDB faster and far more¯exibly than the older printed indices. The 3DB Browser (Sussman, 1997) has been upgraded and enhanced to meet the increasing needs of its user community. In parallel, PDB's new AutoDep facility allows researchers to deposit their data quickly and accurately over the WWW directly to the PDB, at either the European Bioinformatics Institute (EBI), or at BNL. Data are then processed by the PDB staff at Brookhaven.
The PDB faces the constant challenge of keeping abreast of the ever-increasing amount of data it must store and provide to an ever-widening and diversi®ed user community, while maintaining the highest standards of data integrity and reliability, and facilitating data retrieval, knowledge exploration and hypothesis testing. Over the next few years the PDB will be transformed from a simple data repository as at present into a more powerful highly sophisticated knowledge-based system for archiving and accessing structural information that combines the advantages of object-oriented and relational database systems. So as not to interrupt current services, these changes have been introduced gradually, insulating users from drastic changes, and thus have provided both a high degree of compatibility with existing software and a consistent user interface for casual browsers. Collaborative centers have been, and continue to be, established worldwide to assist in data deposition, archiving and distribution.
2. Background and signi®cance of the resource 2.1. The early years: 1971±1988 The PDB was established in 1971 by Dr Walter Hamilton, at the suggestion of members of the American Crystallographic Association (ACA) and participants at the 1971 Cold Spring Harbor Symposium, e.g. see D. C. Phillips remarks of how protein crystallography was Coming of Age (Phillips, 1971). From the beginning, the PDB has operated with the continued support of the crystallographic community. The PDB has always been a truly international effort, initially with af®liated centers at Cambridge, UK; Melbourne, Australia; and Osaka, Japan. (These centers have subsequently been augmented by a number of on-line data providers, 42 at present; see the latest PDB Newsletter for a complete list.) Data acquisition and dissemination, via tape media, was on a global scale from the outset, with a small staff that handled~25 structural depositions per year.
Introduction of the current PDB format in 1972 ensured that these data were readily accessible in a convenient and standard form, not only to crystallographers but also to biologists and chemists. This data format has evolved over the last 20 years into the de facto standard, serving as both input and output for literally hundreds of computer programs. It has proven to be quite¯exible, and recently has been extended for applications that were not imaginable when it was ®rst designed. For example, we have recently inserted HyperText links into PDB ®le headers, dynamically linking them to other databases throughout the world, via the WWW (see URL http://www.pdb.bnl.gov/).

The data explosion: 1989±1992
Rapid developments in the preparation of crystals of macromolecules and in experimental techniques for structure analysis and re®nement have led to a revolution in structural biology. These factors have contributed signi®cantly to an enormous increase in the number of laboratories performing structural studies of macromolecules to atomic resolution and the number of such studies per laboratory. Advances include: (1) recombinant DNA techniques that permit almost any protein or nucleic acid to be produced in large amounts; (2) rapid protein and DNA (gene) sequencing techniques that have made protein sequencing routine; (3) better X-ray detectors; (4) real-time interactive computer graphics systems, together with more automated methods for structure determination and re®nement; (5) synchrotron radiation, allowing the use of extremely tiny crystals, multiple-wavelength anomalous dispersion (MAD) phasing, and time-resolved studies via Laue techniques; (6) NMR methods permitting structure determination of macromolecules in solution; and (7) electron microscopy (EM) techniques, for obtaining high-resolution structures.
These dramatic advances produced an abrupt transition from the linear growth of 15±25 new structures deposited per year in the PDB before 1987 to a rapid exponential growth reaching the current rate of about 50 submissions per week (see Fig. 1).
In the same period, the proliferation and increasing power of computers, the introduction of relatively inexpensive interactive graphics, and growth of computer networks greatly increased the demand for access to PDB data in many diverse ways. The requirements of molecular biologists, rational drug designers, and others in academia and industry are often fundamentally different from those of crystallographers and computational chemists who had been the major PDB users since the 1970s.

Contents and access to the PDB archives
The archives contain atomic coordinates, bibliographic citations, primary and secondary structure information, as well as crystallographic structure factors and NMR experimental data. Annotations in the structure entries include amino-acid or nucleotide sequences (with notes of any con¯icts between the structure in the PDB and sequence databases), source organism from which the biological material was derived, references to papers, secondary structure, complexes with small molecules included within the structure, etc. Third-party annotations include images and movies of structures, pointers to other databases which contain information on the structural class or family of the particular structure; pointers to particular specialized databases (maintained by others) such as the Protein Kinase Resource (http://www.sdsc.edu/Kinases/ pk_home.html) esther (http://www.ensam.inra.fr/cholinesterase/) or Archive of Obsolete PDB Entries databases (http://pdbobs.sdsc.edu/PDBObs.cgi) and those that provide additional experimental information such as the BioMagResBank (BMRB) NMR structural   Table 1 is a summary of the contents of PDB. Present plans are to keep abreast of the deposition rate within a timeline of three months or less from receipt of an entry to ®nal archiving. This includes the time spent in careful checking by the PDB professional staff as well as a period for the depositor to double check the processed entry.
PDB entries are available on CD-ROM, which PC users can search using the PDB-SHELL browser. UNIX users can also search the CD if they download a copy of the browser software. The entries are also available over the Internet from Brookhaven and 14 mirror sites worldwide, listed in Table 2. They can be searched and retrieved via the Internet browser (Peitsch et al., 1995;Stampf et al., 1995) and now the 3DB Browser (Sussman, 1997), that is interfaced through WWW browsers such as Netscape, Explorer etc., as illustrated in Fig. 2. All these search methods provide direct access to the molecular viewing program RasMol (Sayle & Milner-White, 1995).
The 3DB Browser has a number of features that make it easy to access information found in PDB entries. Users can search according to any combination of such ®elds as compound name, experiment title, authors (depositors), biological source, journal references, date of deposition, and nature of small molecules (heterogens) complexed with the structure. Boolean operators allow highly complex search strings. Entries selected can be retrieved automatically, and the molecular structures can be displayed using the public domain molecular viewer RasMol (Sayle & Milner-White, 1995), Netscape's Chemscape Chime plug-in, or a similar viewer. They also include HyperText links to the SwissProt protein sequences database (Bairoch & Boeckmann, 1994) BioMagResBank (BMRB) NMR structural database (Seavey et al., 1991), the Enzyme Commission Database (Bairoch, 1994), PubMed access to the Medline database, and several other databases (see Table 3 for a list of linked external data sources). Internet access to the archives has become the primary mode of retrieving entries from the PDB. However, PDB continues to receive a considerable number of orders for our CD-ROM product. PDB anticipates that this will continue to be true for a variety of reasons. For example, network performance still remains poor in a number of locations, and these disks, released quarterly, provide local access to the contents of the archive. With this software, all ®les in the PDB are stored locally and changes may be automatically updated on a daily basis by use of mirroring software distributed by the PDB.

Data deposition
Since its inception in 1971, the method followed by the PDB for entering and distributing information has  (Fig. 3).
AutoDep then calls a suite of validation programs, whose output is returned via the WWW to the depositor within minutes of sending the data to the PDB. Based on these checks, authors may decide to give permission to release the entry immediately; to release it after up to a maximum one year hold; or go back and reexamine the structure in light of the output diagnostics before completing the submission procedure. The PDB ID code is issued only after the author gives release approval. The submitted data must include all mandatory information as described in the October 1997 PDB Newsletter (http://www.pdb.bnl.gov/pdb-docs/ newsletter.html) and in the List of Items Mandatory for a Complete PDB Submission (http://www.pdb.bnl.gov/ pdb-docs/mandatory_items.html). The data must also pass certain validation criteria as described in the January 1998 PDB Newsletter, and in the document Validation for Layered Release (http://www.pdb.bnl.gov/ pdb-docs/validation.html). Entries passing the validation criteria are released clearly identi®ed as LAYER-1. An associated ®le containing output diagnostics is also released.
Following this, PDB staff process the entry as was performed previously. The entry and the output of the validation suite are then evaluated by a PDB scienti®c staff member, who completes the annotations and returns the entry to the author for comment and approval. Table 4 summarizes checks included in our current data-validation suite. Corrections from the author are incorporated into the entry, which is reanalyzed and validated before being archived and released. Most of this work covers issues not now fully delegated to automatic software. The resulting entry, after author  (Peitsch et al., 1995;Stampf et al., 1995;Sussman, 1997). On the left is the browse screen, with windows to enter search strings. In the upper right, the selected entries, i.e., acetylcholinesterase (1ACJ) (Harel et al., 1993) displayed as a ribbon diagram with RasMol (Sayle & Milner-White, 1995). On the lower right, the text of the 1ACJ entry is shown with the blue text indicating a HyperLink to other databases, including the SwissProt protein sequence database (Bairoch & Boeckmann, 1994) approval, will be equivalent to the traditional PDB entry and will be designated LAYER-2. We strongly believe that such thorough checking and annotation is essential for ensuring the long-term value of the data. Originally data¯ow was a manual system, designed for a staff of one to two scientists, and a deposition rate of about 25±50 entries per year. One person processed an entry from submission through its release. By the late 1980s, when the ®rst steps at automation were being introduced, running the validation programs took about 4 h per entry. Today, the same step, which is highly automated and includes a vastly improved set of validation programs, takes about 15 min. Graphical viewing of data, a useful and powerful annotating and checking tool, has been available to processors since 1992.
Ideally, PDB would like the entire deposition process to be automatic. However, certain kinds of problems continue to require manual intervention and processing. The most troublesome areas remain those involving handling of heterogens (small molecules complexed with the structure), resolving crystal packing issues, representing molecules with non-crystallographic symmetry, and resolving con¯icts between the submitted amino-acid sequence and that found in the sequence databases. Publications and other references are sometimes consulted to verify factual information such as crystal data, biological details, reference information, etc. Processing programs, although much improved over those used in 1991, still allow errors to pass undetected through the system, requiring a visual check of all entries. We are striving to expand the AutoDep suite of

Funding
The PDB is supported by a combination of Federal Government Agency funds and user fees. Support is provided by the US National Science Foundation, the US Public Health Service, National Institutes of Health, National Center for Research Resources, National Institutes of General Medical Sciences, National Library of Medicine, the US Department of Energy and user fees.

Examples of impact of the PDB
There are numerous examples in molecular biology, medicine and drug discovery where the PDB is playing an increasingly important role. Possibly the best examples of the use of structural information used to help in the design of new drugs to combat disease is in the area of HIV infection. At present there are already seven HIV proteins whose three-dimensional structures have been determined, see Fig. 4. These have aided in the design of several drugs that have as their targets one of these proteins.  Converting the PDB to 3DB involves changes in every aspect of current operations. The new system relies on a relational database system for data management and archiving using the Object-Protocol Model (OPM) tools (http://gizmo.lbl.gov/ opm.html) (Chen & Markowitz, 1995). This development effort attempts to address the needs of the diverse user community served by the PDB. The system is being designed with the expectation that it will be federated with other biological databases. Our hope is that this system will allow complex queries to be submitted to the 3DB, parts of which may need to be sent automatically to other databases for processing, and return a composite answer. In addition to providing users with a powerful environment for complex ad-hoc queries, 3DB-Base will also facilitate management of the growing archive, which is expected to contain over 30 000 structural reports by the year 2000. It will fully support the new IUCr archival format mmCIF for deposition and queries. This work is being performed as a collaboration among the following groups: The Protein Data Bank, Brookhaven National Laboratory; European Bioinformatics Institute (EBI); Cambridge Crystallographic Data Centre (CCDC); Bioinformatics Unit, Weizmann Institute of Science; BioMagResBank, University of Wisconsin (BMRB); OPM Data Management Tools Project, Lawrence Berkeley National Laboratory; and Gene Logic Inc., Berkeley, CA.

Related databases
See Table 5 for key WWW sites related to threedimensional structures of biological macromolecules.