Protein Three-Dimensional Structural Databases: Domains, Structurally Aligned Homologues and Superfamilies
aDepartment of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1QW, England, bKunming Institute of Zoology, The Chinese Academy of Sciences, Eastern Jiaochang Road, Kunming, Yunnan 650223, People's Republic of China, and cPfizer Central Research, Sandwich, Kent CT13 9NJ, England
*Correspondence e-mail: firstname.lastname@example.org
This paper reports the availability of a database of protein structural domains (DDBASE), an alignment database of homologous proteins (HOMSTRAD) and a database of structurally aligned superfamilies (CAMPASS) on the World Wide Web (WWW). DDBASE contains information on the organization of structural domains and their boundaries; it includes only one representative domain from each of the homologous families. This database has been derived by identifying the presence of structural domains in proteins on the basis of inter-secondary structural distances using the program DIAL [Sowdhamini & Blundell (1995), Protein Sci. 4, 506–520]. The alignment of proteins in superfamilies has been performed on the basis of the structural features and relationships of individual residues using the program COMPARER [Sali & Blundell (1990), J. Mol. Biol. 212, 403–428]. The alignment databases contain information on the conserved structural features in homologous proteins and those belonging to superfamilies. Available data include the sequence alignments in structure-annotated formats and the provision for viewing superposed structures of proteins using a graphical interface. Such information, which is freely accessible on the WWW, should be of value to crystallographers in the comparison of newly determined protein structures with previously identified protein domains or existing families.
The Brookhaven Protein Data Bank (PDB) (Bernstein et al., 1977) currently contains over 7000 entries; after removing the repeated entries of identical proteins (such as the same protein in different complexes or at different resolutions), there remain 1729 proteins (Brenner et al., 1997), including many homologues (see Fig. 1). If only representative structures from the homologous protein `family' are retained such that no two proteins have more than 25% sequence identity (Hobohm et al., 1992; May 1997 release), the resultant data set still includes 687 proteins. This corresponds to 463 superfamilies of protein domains with 96 superfamilies arising from more than one family (Brenner et al., 1997).
Proteins that have diverged but retain high sequence identity fold into similar three-dimensional structures and usually perform similar functions – these clearly belong to a homologous family (Richardson, 1981; Rossmann & Argos, 1977; Chothia, 1984; Overington et al., 1990, 1993). Proteins or domains of proteins that adopt the same three-dimensional fold despite poor sequence identity and perform remotely similar functions (Blundell & Humbel, 1980; Murzin & Chothia, 1992; Murzin et al., 1995; Murzin, 1996) are termed superfamilies. The identification of new members belonging to pre-existing families and superfamilies is straightforward only when contiguous residues forming a functional motif are conserved, where PROSITE searches may be appropriate (Bairoch, 1991). Furthermore these should be distinguished from proteins with no sequence identity and no similarity of functions that nevertheless have the same fold or superfolds (Orengo et al., 1994).
An analysis of protein sequence and structure entries indicates that about 50% of the `new' sequences could be attributed a previously known function and roughly 20% of the sequences have homologues of known structure (Bork et al., 1992, 1994; Koonin et al., 1994). When the crystal structure of a `new' protein is determined, it is important to compare its structure with the previously determined structures. This is facilitated by the existence of databases of aligned protein structures and sequences (Overington et al., 1990, 1993; Johnson et al., 1993).
Often homology or structural similarity exists between parts of two different proteins; one or two domains only may be conserved (Wetlaufer, 1973; Richardson, 1981; Wodak & Janin, 1981; Go, 1981). Although algorithms to identify such compact sub-structures have been developed (Schulz, 1977; Crippen, 1978; Rose, 1979; Zehfus & Rose, 1986), it is convenient to use automatic methods so that the information of domain organization can be compiled for the large number of protein structures now available (Islam et al., 1995; Siddiqui & Barton, 1995; Swindells, 1995; Nichols et al., 1995). We have constructed a database of protein structural domains (DDBASE) (Sowdhamini et al., 1996) using the procedure DIAL (Sowdhamini & Blundell, 1995).
Structure-based alignment of sequences of related protein domains provides a basis for understanding evolutionary relationships as well as diversity in function and specificity. Such alignments can be used to derive information on amino-acid replacements which are of value also in comparative modelling and fold recognition (Overington et al., 1990). Databases of structural alignments of homologous proteins (HOMSTRAD: HOMologous STRucture Alignment Database) (Overington et al., 1990, 1993; Mizuguchi et al., 1998) and protein superfamilies (CAMPASS: CAMbridge database of Protein Alignments organized as Structural Superfamilies) (RS, Sowdhamini et al., 1998) will be described in this paper. Because of the low percentage of sequence identities amongst distantly related proteins, it is difficult, on the basis of sequence alone, to obtain reliable alignments where secondary structures and functionally important residues are aligned correctly. Alignment of proteins in superfamilies, therefore, is based on the conservation of structural features and relationships using the program COMPARER (Sali & Blundell, 1990; Zhu et al., 1992). The three databases, described here, are available on the WWW (http://www-cryst.bioc.cam.ac.uk/~ddbase for DDBASE, http://www-cryst.bioc.cam.ac.uk/~homstrad for HOMSTRAD and http://www-cryst.bioc.cam.ac.uk/~campass for CAMPASS).
DDBASE is a compilation of the information on structural domains that are present in a representative set of 436 protein chains (Sowdhamini et al., 1996). The identification of structural domains in a protein chain was performed using the program DIAL (Sowdhamini & Blundell, 1995), where elements of secondary structure are clustered on the basis of the proximity to each other. This gave rise to 695 structural domains, of which 206 are α-rich, 191 are β-rich and 294 fall under the α-and-β class. 63% of the domains are from multi-domain proteins and 73% of the identified domains have less than 150 residues.
The organization of structural domains in individual protein chains is described on the WWW page assigned to that protein chain; an example is shown in Fig. 2. Secondary-structural dendrograms are provided that correspond to the clustering based on distances between all possible pairs of secondary structures. All possible combinations of nodes in the secondary-structural dendrogram are automatically examined for compactness of putative domains corresponding to clusters and listed with their disjoint-factor values (see Sowdhamini & Blundell, 1995, for details). It is possible for the user to extract the domain boundary corresponding to any situation by clicking on that entry. However, the `best' domain boundaries, defined by the program, have been identified and the domain organization may be viewed on graphics using RasMol (Sayle & Milner-White, 1995). Each domain can be identified by its unique six-character code (the first four characters correspond to the PDB code of the protein, the fifth to the chain identifier and the sixth, as a subscript, corresponds to the domain numbering as in the individual domain pages).
DDBASE can be used to trace similarities where particular domains are shared between proteins. It is especially useful where there are discontinuous domains. 400 large (with seven or more secondary structures) domains can be grouped into 30 classes on the basis of the structural similarity estimated from structural environments of individual secondary structures (Rufino & Blundell, 1994; Sowdhamini et al., 1996). The clustering of individual protein domains into structurally similar classes can also be examined on the DDBASE WWW page.
HOMSTRAD and CAMPASS are databases of structure-based alignments of protein sequences, grouped into homologous families and superfamilies, respectively. Aligned sequences of families of homologous protein structures are available in HOMSTRAD (Overington et al., 1990, 1993) and categorized according to the secondary-structural classes. There are 130 homologous protein families with at least two members in the March 1998 version. The sequences of homologous proteins within a family are initially aligned using the rigid-body superposition program MNYFIT (Sutcliffe et al., 1987) or COMPARER (Sali & Blundell, 1990; Zhu et al., 1992) and later subjected to a careful manual examination. Similar types of information are available for CAMPASS, the database of protein (domain)s belonging to superfamilies (RS, Sowdhamini et al., 1998). Superfamilies of structural domains were selected initially on the basis of structural environ-ment at secondary structural units (Rufino & Blundell, 1994; Sowdhamini et al., 1996). The selection of superfamilies has been extended by referring to SCOP (Murzin et al., 1995) and by including smaller domains like the cystine-knots, not considered earlier in the clustering analysis since they were not easy to compare using automatic structure-based procedures. 367 of 451 superfamilies annotated in SCOP have single families (Brenner et al., 1997; the more recent February 1998 release of SCOP has 419 of the 571 superfamilies with single families). Superfamily members were chosen such that no two domains within a superfamily share more than 25% sequence identity (alignments of closely related proteins are available in HOMSTRAD). This cut-off is consistent with the DDBASE definition in choosing representative protein chains. A rigorous sequence-alignment program, COMPARER (Sali & Blundell, 1990; Zhu et al., 1992), was used to align the members of a superfamily on the basis of structural features and relationships, which are equivalenced using simulated annealing. Table 1 lists protein superfamilies, with at least two members within the above-defined cut-off of sequence identity, whose alignments have been compiled in the March 1998 version. This includes 67 multi-member superfamilies which involves 293 domains representing 464 homologous proteins. There are a further 357 superfamilies, annotated in SCOP, which have single members (Murzin et al., 1995; Brenner et al., 1997). A few other multi-member superfamilies included in SCOP, such as the DNA-binding HMG box, pheromones, annexins and insulin-superfamily, were excluded from CAMPASS as members exhibited more than 25% sequence identity.
‡This family is yet to be added in the homologous alignment database.
The WWW site of HOMSTRAD (Mizuguchi et al., 1998) provides a page for each of the families. The name of the protein, source, resolution and R factor are given for each family member corresponding to a PDB entry. The alignment of sequences is formatted in JOY (Overington et al., 1990) which highlights the conservation of local-residue structural features such as secondary structure, solvent accessibility and hydrogen bonding. Fig. 3 shows the alignment of cytochrome c from different sources and its homologues (cytochrome c2 and cytochrome c550), as an example.
CAMPASS, on the WWW, provides information on the superfamilies: for each superfamily member, the name, source, resolution and domain boundaries are given. The beginning and end residue numbers for each segment of discontinuous domains are recorded. The pairwise percentage identity matrix of the members is provided. The structure-based alignment in the JOY-annotated form (Overington et al., 1990), similar to that described in HOMSTRAD, is shown and also available for extraction in the form of PostScript files, or as LATEX or HTML files or as a plain text file. Fig. 4 shows the alignment of the cytochrome superfamily as an example. A single representative (1ycc) of the nine cytochrome homologues (see above and Fig. 3) has been aligned with rather distantly related cytochromes such as cytochrome c6 and c551. The structures of the proteins within a family/superfamily have been superposed using MNYFIT (Sutcliffe et al., 1987), where the equivalent residues correspond to the final alignment. These superposed structures can be viewed on the WWW using the RASMOL graphics interface (Sayle & Milner-White, 1995).
Fig. 5 shows the distribution of pairwise percentage identities in the two alignment databases. Protein pairs in HOMSTRAD have a broad range of pairwise sequence identities with a slightly bimodal distribution (237 pairs have sequence identities between 25 and 30% and 121 pairs have sequence identities between 60 and 65% out of a total of 1962 pairs). However, the majority of homologous proteins in the database have sequence identities between 15 and 65%. The distribution of pairwise sequence identity of members within superfamilies (CAMPASS) is restricted to a maximum of 25%. A vast majority of protein pairs (449 out of 665) have pairwise percentage identities between 5 and 15%.
HOMSTRAD and CAMPASS are distinct from but complementary to other databases. SCOP (Murzin et al., 1995) has classified the entire Protein Data Bank at different levels of structural hierarchy and structural domains are defined. There is emphasis on functionality in the clustering of folds. SCOP does not attempt to perform or present sequence or structural alignments. CATH (Orengo et al., 1993, 1994) was originally designed and developed for whole proteins where the authors had taken particular caution to exclude multi-domain proteins. Subsequently, the structures have been systematically classified at the level of domains (Orengo et al., 1997). CATH does not include structure-based alignments of sequences. FSSP (Holm & Sander, 1994) is most similar to HOMSTRAD and CAMPASS due to the fact that FSSP also provides structure-based sequence alignments, even incorporating remote homologues. However, the alignments do not distinguish homologues and superfamilies from those which only share a similar fold. The databases described in this paper contain structure-based alignments that have been specially annotated to describe the structural environment at residue positions. This should provide extra information useful in the comparison of protein structures.
Bairoch, A. (1991). Nucleic Acids Res. 19, 2013–2018.
Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). J. Mol. Biol. 112, 535–542. CSD CrossRef CAS PubMed Web of Science
Blundell, T. L. & Humbel, R. E. (1980). Nature (London), 287, 781–787. CrossRef CAS PubMed Web of Science
Bork, P., Ouzounis, C. & Sander, C. (1994). Curr. Opin. Struct. Biol. 4, 393–403. CrossRef Web of Science
Bork, P., Ouzounis, C., Sander, C., Scharf, M., Schneider, R. & Sonnhammer, E. (1992). Nature (London) 358, 287–287. CrossRef PubMed CAS Web of Science
Brenner, S. E., Chothia, C. & Hubbard, T. J. P. (1997). Curr. Opin. Struct. Biol. 7, 369–376. CrossRef CAS PubMed Web of Science
Chothia, C. (1984). Ann. Rev. Biochem. 53, 537–572. CrossRef CAS PubMed Web of Science
Crippen, G. M. (1978). J. Mol. Biol. 126, 315–332. CrossRef CAS PubMed Web of Science
Go, M. (1981). Nature (London), 291, 90–92. CAS PubMed Web of Science
Hobohm, U., Scharf, M., Schneider, R. & Sander, C. (1992). Protein Sci. 1, 409–417. CrossRef PubMed CAS
Holm, L. & Sander, C. (1994). Nucleic Acids Res. 22, 3600–3609. CAS PubMed Web of Science
Hubbard, T. J. P. & Blundell, T. L. (1987). Protein Eng. 1, 159–171. CrossRef CAS PubMed Web of Science
Islam, S. A., Luo, J. & Sternberg, J. E. (1995). Protein Eng. 8, 513–525. CrossRef CAS PubMed Web of Science
Johnson, M. S., Overington, J. P. & Blundell, T. L. (1993). J. Mol. Biol. 231, 735–752. CrossRef CAS PubMed Web of Science
Koonin, E. V., Bork, P. & Sander, C. (1994). EMBO J. 13, 493–503. CAS PubMed Web of Science
Mizuguchi, K., Deane, C., Overington, J. P. & Blundell, T. L. (1998). Protein Sci. In the press.
Murzin, A. G. (1996). Curr. Opin. Struct. Biol. 6, 386–394. CrossRef CAS PubMed Web of Science
Murzin, A. G. & Chothia, C. (1992). Curr. Opin. Struct. Biol. 2, 895–903. CrossRef CAS
Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). J. Mol. Biol. 247, 536–540. CrossRef CAS PubMed Web of Science
Nichols, W. L., Rose, G. D., Eyck, L. F. T & Zimm, B. H. (1995). Proteins, 23, 38–48. CrossRef CAS PubMed Web of Science
Orengo, C. A., Flores, T. P., Taylor, W. R. & Thornton, J. M. (1993). Protein Eng. 6, 485–500. CrossRef CAS PubMed Web of Science
Orengo, C. A., Jones, D. T. & Thornton, J. M. (1994). Nature (London), 372, 631–634. CrossRef CAS PubMed Web of Science
Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. & Thornton, J. M. (1997). Structure, 5, 1093–1108. CrossRef CAS PubMed Web of Science
Overington, J. P., Johnson, M. S., Sali, A. & Blundell, T. L. (1990). Proc. R. Soc. London Ser. B, 241, 132–145. CrossRef CAS Web of Science
Overington, J. P., Zhu, Z.-Y., Sali, A., Johnson, M. S., Sowdhamini, R., Louie, G. V. & Blundell, T. L. (1993). Biochem. Soc. Trans. 21, 597–604. CrossRef CAS PubMed Web of Science
Richardson, J. S. (1981). Adv. Protein Chem. 34, 167–339. CrossRef CAS PubMed
Rose, G. D. (1979). J. Mol. Biol. 134, 447–470. CrossRef CAS PubMed Web of Science
Rossmann, M. G. & Argos, P. (1977). J. Mol. Biol. 109, 99–129. CrossRef CAS PubMed Web of Science
Rufino, S. D. & Blundell, T. L. (1994). Comput. Aided Mol. Design, 8, 5–27. CrossRef CAS Web of Science
Sali, A. & Blundell, T. L. (1990). J. Mol. Biol. 212, 403–428. CrossRef CAS PubMed
Sayle, R. A. & Milner-White, E.J. (1995). Trends Biochem. Sci. 20, 374–376. CrossRef CAS PubMed Web of Science
Schulz, G. E. (1977). Angew. Chem. Intl Ed. 16, 23–33. CrossRef CAS Web of Science
Siddiqui, A. S. & Barton, G. J. (1995). Protein Sci. 4, 872–884. CrossRef CAS PubMed Web of Science
Sowdhamini, R. & Blundell, T. L. (1995). Protein Sci. 4, 506–520. CrossRef CAS PubMed Web of Science
Sowdhamini, R., Burke, D. F., Huang, J.-F., Mizuguchi, K., Nagarajaram, H. J., Srinivasan, N., Steward, R. E. & Blundell, T. L. (1998). Structure. In the press.
Sowdhamini, R., Rufino, S. D. & Blundell, T. L. (1996). Folding Design, 1, 209–220. CrossRef CAS PubMed Web of Science
Sutcliffe, M. J., Haneef, I., Carney, D. & Blundell, T. L. (1987). Protein Eng. 1, 377–384. CrossRef CAS PubMed Web of Science
Swindells, M. B. (1995). Protein Sci. 4, 103–112. CrossRef CAS PubMed Web of Science
Wetlaufer, D. B. (1973). Proc. Natl Acad. Sci. USA, 70, 697–701. CrossRef CAS PubMed Web of Science
Wodak, S. J. & Janin, J. (1981). Biochemistry, 20, 6544–6553. CrossRef CAS PubMed Web of Science
Zehfus, M. H. & Rose, G. D. (1986). Biochemistry, 25, 5759–5765. CrossRef CAS PubMed Web of Science
Zhu, Z.-Y., Sali, A. & Blundell, T. L. (1992). Protein Eng. 5, 43–51. CrossRef PubMed CAS Web of Science
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.