research papers
New Tools and Resources for Analysing Protein Structures and Their Interactions
aBiomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College, Gower Street, London WC1E 6BT, England, bDepartment of Crystallography, Birkbeck College, Malet Street, London WC1E 7HX, England, and cEMBL Outstation, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, England
*Correspondence e-mail: thornton@biochem.ucl.ac.uk
The determination of protein structures has furthered our understanding of how various proteins perform their functions. With the large number of structures currently available in the PDB, it is necessary to be able to easily study these proteins in detail. Here new software tools are presented which aim to facilitate this analysis; these include the PDBsum WWW site which provides a summary description of all PDB entries, the programs TOPS and NUCPLOT to plot schematic diagrams representing protein topology and DNA-binding interactions, SAS a WWW-based sequence-analysis tool incorporating structural data, and WWW servers for the analysis of protein–protein interfaces and analyses of over 300 haem-binding proteins.
1. Introduction
Here we present a number of new software tools and WWW servers developed for the analysis of protein structures and their interactions with other molecules. These tools were developed in the course of our research, involving computational analysis of many structures in the Protein Data Bank (PDB, Bernstein et al., 1977). Many of the ideas have arisen from studies by crystallographers on individual proteins and their complexes, in which analyses and diagrams are usually performed by hand using ad hoc programs. When faced with hundreds of structures to analyse, it becomes necessary to develop more robust software which is then of use for studying any new structure. The tools we have developed are, for the most part, freely available to the academic community via the WWW (http://www.biochem.ucl.ac.uk/bsm/biocomp/). Additionally they have been used to establish a number of WWW-based resources to provide information on all entries in the PDB. In the descriptions below we use the structure of a protein–DNA complex as an example: namely, PDB entry 1ber which holds the structure of E. coli catabolite gene activator protein (CAP) bound to the DNA molecule 31-2E, determined using X-ray crystallography to a resolution of 2.5 Å (Parkinson et al., 1996).
2. PDBsum
We start with PDBsum (Laskowski et al., 1997) which is a WWW-based database of structural analyses of all entries in the PDB. It makes use of, or provides links to, the majority of the software tools described in this review. Each PDB entry has its own WWW page within PDBsum (http://www.biochem.ucl.ac.uk/bsm/pdbsum) giving an at-a-glance summary of what the entry contains: its protein chains and their secondary structure, any DNA/RNA chains, ligands and water molecules. Attached is a wealth of structural analyses of these molecules as well as extensive links to data in other WWW-based databases. The majority of these analyses are automatically generated soon after a new entry is released by the PDB. The entries can be accessed in a number of ways: by their PDB code, by a simple keyword search, via the Het groups they contain, by E.C. number, or from other databases including our own CATH database (Orengo et al., 1997) and PDB's 3DB database (Stampf et al., 1995).
Figs. 1 and 2 show extracts from the PDBsum page for 1ber. Fig. 1 gives the header information relating to the structure as a whole, including schematic diagrams of the molecules in the file, icons for viewing the coordinates in three-dimensional using a VRML browser or RasMol (Sayle & Milner-White, 1995), names of authors, resolution, R factor, and so on. Various links go to other databases including SWISS-PROT (Bairoch & Boeckmann, 1994) and the Nucleic Acid Database (Berman et al., 1992).
Fig. 2 gives a schematic, or `wiring diagram', of the secondary structure and motifs in the A chain of 1ber. The motifs, computed by the PROMOTIF program (Hutchinson & Thornton, 1996), include helices, strands, β-turns, γ-turns, and β-hairpins. Also shown are the domain assignments and which residues are in contact with the DNA. For each domain a link goes to the appropriate structural classification in the CATH database (http://www.biochem.ucl.ac.uk/bsm/cath/).
Where a PDB file contains one or more small-molecule ligands the PDBsum entry includes a LIGPLOT (Wallace et al., 1995) of the ligand interactions with the protein. For protein–DNA complexes, such as 1ber, a NUCPLOT (see below) of the interactions between the protein and DNA is given. Links are provided to TOPS topology cartoons and the SAS sequence search and annotation server, both described below.
3. TOPS
TOPS is an `atlas' of protein topology cartoons at http://tops.ebi.ac.uk/tops, described by Westhead et al. (1998). The TOPS cartoons are schematic diagrams representing the overall topology of a protein chain, or of its constituent domains (Flores et al., 1994). Fig. 3 shows the TOPS diagram for the two domains in chain A of 1ber. The two cartoons show the protein's helices as circles, its strands as triangles and their connectivity along the chain as lines joining these symbols. This provides a simple representation of the relative directions and positions of the secondary-structural elements within each fold. In the case of 1ber, the first domain is an αβ domain incorporating several helices and a jelly roll, while the second has a simpler topology consisting of a single β-sheet with three α-helices.
The TOPS server holds a topology cartoon to represent every structure in the PDB together with a large amount of information about protein topology in general. Additionally, one can submit a set of coordinates and generate a topology cartoon which can be modified using a Java-based editor.
4. NUCPLOT
NUCPLOT (Luscombe et al., 1997) is a program written to aid the analysis of protein–nucleic acid complexes. It generates a schematic diagram showing the protein residues that are involved in binding to DNA and RNA and how they interact with the bases and the sugar–phosphate backbone of The resulting diagrams give a clear and simple representation of the important interactions within the complex.
Fig. 4 shows part of a NUCPLOT for 1ber. In this structure, the protein binds as a homodimer to a 30 base-pair site in an approximately symmetrical fashion. Each monomer contains a helix–turn–helix motif which provides both the DNA binding site and the dimer interface. The part of the DNA chain shown in Fig. 4 is a segment of the half site bound to chain A of the protein. The protein residues shown on the plot are those that interact with the DNA either via hydrogen bonds or through van der Waals contacts. From the diagram it can be seen that amino acids are hydrogen bonded to the DNA backbone between base 3 and 6 on chain C and bases 9 and 11 on chain F.
The NUCPLOT program is available via ftp from URL http://www.biochem.ucl.ac.uk/~nick/nucplot.html. NUCPLOT diagrams for all protein–DNA complexes in the PDB can be found in the PDBsum database.
5. SAS
In analysing structures and their interactions it is often of value to compare related proteins, especially if the structures of different complexes have been determined (e.g. a series of enzyme–inhibitor complexes). Also in analysing sequences (e.g. from different species) the structural information may be of importance. To facilitate the use of structural information in sequence analysis, we have developed a WWW-based tool called SAS – Sequence Annotated by Structure.
This tool annotates the sequences of known structures with structural information at the residue level, derived by programs developed at UCL. The annotations are represented by colouring individual residues in a sequence, according to selected structural properties such as secondary structure, interatomic contacts and active-site information.
The WWW interface (http://www.biochem.ucl.ac.uk/bsm/sas) has several uses. It can be used to annotate a single sequence from the PDB to view its structural features along the length of the sequence. Alternatively, a multiple sequence alignment can be submitted to show, say, the trends and differences in the structural features of a family of related proteins. And finally, and perhaps most usefully, if a sequence of unknown structure is submitted to SAS the sequences in the PDB are scanned and all related sequences are extracted and annotated by their structural features. This can help in identifying distant homologues by showing whether structurally important residues are present in equivalent positions in the query sequence.
Fig. 5 illustrates the SAS output for a target sequence (SWISS-PROT code P51007) which has as its closest match the sequence of 1ber (sequence identity 24.7%). The structural annotation of the 1ber sequence shows its secondary structure and its residues coloured according to the numbers of contacts they make with the DNA. It can be seen that the predicted secondary structure for the target sequence is in good agreement with the actual secondary structure of 1ber, and the region exhibiting the largest numbers of DNA interactions (bottom line of the alignment) also has a high degree of similarity between the two sequences, strongly suggesting the target sequence is structurally homologous to 1ber.
6. Protein–protein interactions
Protein–protein interactions are the basis for many biological functions, and a clear description of an interface is an essential starting point to understanding how the complex is formed and perhaps to guide the design of molecules to inhibit complex formation. A new WWW site (http://www.biochem.ucl.ac.uk/bsm/PP/server) has been created for the analysis of protein–protein interaction sites in multimeric structures. The protein interaction server enables the user to submit the three-dimensional coordinates of a protein complex to obtain a set of physical and chemical parameters that characterize the nature of the protein–protein interface.
Fig. 6 shows the information provided by the server for the interface between chains A and B in the 1ber homodimer. The server calculates data on the size of the protein interface in terms of the lost accessible surface area per chain, the shape (length, breadth and planarity), the intermolecular bonding, polarity, bridging water molecules and packing (Jones & Thornton, 1995). A listing of the residues involved in the protein–protein interface (i.e. whose accessible surface area decreases by >1 Å2 on complex formation) is also given, indicating the relative importance of each residue. The parameters for individual structures can be compared to the distributions obtained from data sets of known protein–protein complexes (Jones & Thornton, 1996). Such comparisons allow an estimation of the `normality' of interfaces in new protein complexes, and may be helpful in distinguishing crystal contacts from those of biological relevance.
7. Protein–haem interactions
In the PDB there are many examples of protein–haem complexes, which have provided the data for a detailed analysis of how proteins recognize and bind this common biomolecule. This analysis has considered many different aspects including the conformation and relative burial of the haem and the nature of the protein interface.
Haem is an aromatic porphyrin molecule acting as a http://www.biochem.ucl.ac.uk/bsm/proLig) and as an example Fig. 7 shows the variation in planarity of the haem group in structures representing the 13 different families. The information is useful for comparing the structures of haem groups in newly solved structures.
bound to a variety of functionally diverse proteins that have widely differing tertiary structures. In total we analysed 13 non-homologous families, including more than 321 entries in the PDB. The results of this analysis are available on the Internet (In future we plan to generalize this approach to analyse any protein–ligand complex to enable a comparative analysis of the intermolecular interactions. Such studies can reveal the differences in binding sites for one molecule bound to a variety of different protein families and facilitate prediction of the geometry of such protein–ligand complexes and rules to guide the design of novel binding proteins.
Acknowledgements
We acknowledge support from the following: NML is supported by a BBSRC special studentship, DM is supported by a BBSRC CASE studentship, sponsored by Roche Discovery Welwyn, MK is supported by a European Union Training and Mobility of Researchers Programme.
References
Bairoch, A. & Boeckmann, B. (1994). Nucleic Acids Res. 22, 3578–3580. CrossRef CAS PubMed Web of Science
Berman, H. M., Olson, W. K., Beveridge, D. L., Westbrook, J., Gelbin, A., Demeny, T., Hsieh, S-H., Srinivasan, A. R. & Schneider, B. (1992). Biophys. J. 63, 751–759. CrossRef PubMed CAS Web of Science
Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F. Jr, Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). J. Mol. Biol. 112, 535–542. CSD CrossRef CAS PubMed Web of Science
Flores, T. P., Moss, D. S. & Thornton, J. M. (1994). Protein Eng. 7, 31–37. CrossRef CAS PubMed Web of Science
Hogue, C. W. V., Ohkawa, H. & Bryant, S. H. (1996). Trends Biochem. Sci. 21, 226–229. CrossRef CAS PubMed Web of Science
Holm, L. & Sander, C. (1994). Nucleic Acids Res. 22, 3600–3609. CAS PubMed Web of Science
Hooft, R. W. W., Vriend, G., Sander, C. & Abola, E. E. (1996). Nature (London), 381, 272. CrossRef PubMed Web of Science
Hubbard, T. J. P., Murzin, A. G., Brenner, S. E. & Chothia, C. (1997) Nucleic Acids Res. 25, 236–239. CrossRef CAS PubMed Web of Science
Hutchinson, E. G. & Thornton, J. M. (1996). Protein Sci. 5, 212–220. CrossRef CAS PubMed Web of Science
Jones, S., Stewart, M., Michie, A. D., Swindells, M. B., Orengo, C. A. & Thornton, J. M. (1998). Protein Sci. 7, 233–242. Web of Science CrossRef CAS PubMed
Jones, S. & Thornton, J. M. (1995). Prog. Biophys. Mol. Biol. 63, 31–65. CrossRef CAS PubMed Web of Science
Jones, S. & Thornton, J. M. (1996). Proc. Natl Acad. Sci. USA, 93, 13–20. CrossRef CAS PubMed Web of Science
King, R D. & Sternberg, M. J. E. (1996). Protein Sci. 5, 2298–2310. CrossRef CAS PubMed Web of Science
Kraulis, P. J. (1991). J. Appl. Cryst. 24, 946–950. CrossRef Web of Science IUCr Journals
Laskowski, R. A., Hutchinson, E. G., Michie, A. D., Wallace, A. C., Jones, M. L. & Thornton, J. M. (1997). Trends Biochem. Sci. 22, 488–490. Web of Science CrossRef CAS PubMed
Laskowski, R. A., MacArthur, M. W., Moss, D. S. & Thornton, J. M. (1993). J. Appl. Cryst. 26, 283–291. CrossRef CAS Web of Science IUCr Journals
Luscombe, N. M., Laskowski, R. A. & Thornton, J. M. (1997). Nucleic Acids Res. 25, 4940–4945. Web of Science CrossRef CAS PubMed
Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). J. Mol. Biol. 247, 536–540. CrossRef CAS PubMed Web of Science
Orengo, C. A., Michie, A. D., Jones, S., Swindells, M. B., Jones, D. T. & Thornton, J. M. (1997). Structure, 5, 1093–1108. CrossRef CAS PubMed Web of Science
Parkinson, G., Wilson, C., Gunasekera, A., Ebright, Y. W. & Berman, H. M. (1996). J. Mol. Biol. 260, 395–408. CrossRef CAS PubMed Web of Science
Pearson, W. R. & Lipman, D. J. (1988). Proc. Natl Acad. Sci. USA, 85, 2444–2448. CrossRef CAS PubMed Web of Science
Sayle, R. A. & Milner-White, E. J. (1995). Trends Biochem. Sci. 20, 374–376. CrossRef CAS PubMed Web of Science
Stampf, D. R., Felder, C. E. & Sussman, J. L. (1995). Nature (London), 374, 572–574. CAS PubMed Web of Science
Wallace, A. C., Laskowski, R. A. & Thornton, J. M. (1995). Protein Eng. 8, 127–134. CrossRef CAS PubMed Web of Science
Westhead, D. R., Hatton, D. C. & Thornton, J. M. (1998). Trends Biochem. Sci. 23, 35–36. Web of Science CrossRef CAS PubMed
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.