Received 27 March 1998
Protein Three-Dimensional Structural Databases: Domains, Structurally Aligned Homologues and Superfamilies
R. Sowdhamini,a+ David F. Burke,a Charlotte Deane,a Jing-fei Huang,a,b Kenji Mizuguchi,a Hampapathulu A. Nagarajaram,a John P. Overington,c N. Srinivasan,a++ Robert E. Stewarda and Tom L. Blundella*
aDepartment of Biochemistry, University of Cambridge, 80 Tennis Court Road, Cambridge CB2 1QW, England,bKunming Institute of Zoology, The Chinese Academy of Sciences, Eastern Jiaochang Road, Kunming, Yunnan 650223, People's Republic of China, and cPfizer Central Research, Sandwich, Kent CT13 9NJ, England
This paper reports the availability of a database of protein structural domains (DDBASE), an alignment database of homologous proteins (HOMSTRAD) and a database of structurally aligned superfamilies (CAMPASS) on the World Wide Web (WWW). DDBASE contains information on the organization of structural domains and their boundaries; it includes only one representative domain from each of the homologous families. This database has been derived by identifying the presence of structural domains in proteins on the basis of inter-secondary structural distances using the program DIAL [Sowdhamini & Blundell (1995), Protein Sci. 4, 506-520]. The alignment of proteins in superfamilies has been performed on the basis of the structural features and relationships of individual residues using the program COMPARER [Sali & Blundell (1990), J. Mol. Biol. 212, 403-428]. The alignment databases contain information on the conserved structural features in homologous proteins and those belonging to superfamilies. Available data include the sequence alignments in structure-annotated formats and the provision for viewing superposed structures of proteins using a graphical interface. Such information, which is freely accessible on the WWW, should be of value to crystallographers in the comparison of newly determined protein structures with previously identified protein domains or existing families.
The Brookhaven Protein Data Bank (PDB) (Bernstein et al., 1977) currently contains over 7000 entries; after removing the repeated entries of identical proteins (such as the same protein in different complexes or at different resolutions), there remain 1729 proteins (Brenner et al., 1997), including many homologues (see Fig. 1). If only representative structures from the homologous protein `family' are retained such that no two proteins have more than 25% sequence identity (Hobohm et al., 1992; May 1997 release), the resultant data set still includes 687 proteins. This corresponds to 463 superfamilies of protein domains with 96 superfamilies arising from more than one family (Brenner et al., 1997).
| || Figure 1 |
A cartoon representation of the classification and alignment of proteins at various structural hierarchies. HOMSTRAD database contains alignments of homologous sequences. Some of them exist as multi-domain proteins (denoted by different coloured spheres). DDBASE is a compilation of structural domains found in representatives of homologous proteins. CAMPASS is a database of aligned protein domains belonging to superfamilies.
Proteins that have diverged but retain high sequence identity fold into similar three-dimensional structures and usually perform similar functions - these clearly belong to a homologous family (Richardson, 1981; Rossmann & Argos, 1977; Chothia, 1984; Overington et al., 1990, 1993). Proteins or domains of proteins that adopt the same three-dimensional fold despite poor sequence identity and perform remotely similar functions (Blundell & Humbel, 1980; Murzin & Chothia, 1992; Murzin et al., 1995; Murzin, 1996) are termed superfamilies. The identification of new members belonging to pre-existing families and superfamilies is straightforward only when contiguous residues forming a functional motif are conserved, where PROSITE searches may be appropriate (Bairoch, 1991). Furthermore these should be distinguished from proteins with no sequence identity and no similarity of functions that nevertheless have the same fold or superfolds (Orengo et al., 1994).
An analysis of protein sequence and structure entries indicates that about 50% of the `new' sequences could be attributed a previously known function and roughly 20% of the sequences have homologues of known structure (Bork et al., 1992, 1994; Koonin et al., 1994). When the crystal structure of a `new' protein is determined, it is important to compare its structure with the previously determined structures. This is facilitated by the existence of databases of aligned protein structures and sequences (Overington et al., 1990, 1993; Johnson et al., 1993).
Often homology or structural similarity exists between parts of two different proteins; one or two domains only may be conserved (Wetlaufer, 1973; Richardson, 1981; Wodak & Janin, 1981; Go, 1981). Although algorithms to identify such compact sub-structures have been developed (Schulz, 1977; Crippen, 1978; Rose, 1979; Zehfus & Rose, 1986), it is convenient to use automatic methods so that the information of domain organization can be compiled for the large number of protein structures now available (Islam et al., 1995; Siddiqui & Barton, 1995; Swindells, 1995; Nichols et al., 1995). We have constructed a database of protein structural domains (DDBASE) (Sowdhamini et al., 1996) using the procedure DIAL (Sowdhamini & Blundell, 1995).
Structure-based alignment of sequences of related protein domains provides a basis for understanding evolutionary relationships as well as diversity in function and specificity. Such alignments can be used to derive information on amino-acid replacements which are of value also in comparative modelling and fold recognition (Overington et al., 1990). Databases of structural alignments of homologous proteins (HOMSTRAD: HOMologous STRucture Alignment Database) (Overington et al., 1990, 1993; Mizuguchi et al., 1998) and protein superfamilies (CAMPASS: CAMbridge database of Protein Alignments organized as Structural Superfamilies) (RS, Sowdhamini et al., 1998) will be described in this paper. Because of the low percentage of sequence identities amongst distantly related proteins, it is difficult, on the basis of sequence alone, to obtain reliable alignments where secondary structures and functionally important residues are aligned correctly. Alignment of proteins in superfamilies, therefore, is based on the conservation of structural features and relationships using the program COMPARER (Sali & Blundell, 1990; Zhu et al., 1992). The three databases, described here, are available on the WWW (http://www-cryst.bioc.cam.ac.uk/~ddbase for DDBASE, http://www-cryst.bioc.cam.ac.uk/~homstrad for HOMSTRAD and http://www-cryst.bioc.cam.ac.uk/~campass for CAMPASS).
DDBASE is a compilation of the information on structural domains that are present in a representative set of 436 protein chains (Sowdhamini et al., 1996). The identification of structural domains in a protein chain was performed using the program DIAL (Sowdhamini & Blundell, 1995), where elements of secondary structure are clustered on the basis of the proximity to each other. This gave rise to 695 structural domains, of which 206 are -rich, 191 are -rich and 294 fall under the -and- class. 63% of the domains are from multi-domain proteins and 73% of the identified domains have less than 150 residues.
The organization of structural domains in individual protein chains is described on the WWW page assigned to that protein chain; an example is shown in Fig. 2. Secondary-structural dendrograms are provided that correspond to the clustering based on distances between all possible pairs of secondary structures. All possible combinations of nodes in the secondary-structural dendrogram are automatically examined for compactness of putative domains corresponding to clusters and listed with their disjoint-factor values (see Sowdhamini & Blundell, 1995, for details). It is possible for the user to extract the domain boundary corresponding to any situation by clicking on that entry. However, the `best' domain boundaries, defined by the program, have been identified and the domain organization may be viewed on graphics using RasMol (Sayle & Milner-White, 1995). Each domain can be identified by its unique six-character code (the first four characters correspond to the PDB code of the protein, the fifth to the chain identifier and the sixth, as a subscript, corresponds to the domain numbering as in the individual domain pages).
| || Figure 2 |
Domain database (DDBASE) WWW page for the B chain of abrin (PDB code, 1abr) as an example. Domains have been identified using the program DIAL (Sowdhamini & Blundell, 1995). The organization of structural domains can be viewed as secondary structural dendrograms where helices and extended strands have been clustered on the basis of intersecondary structural inter-C distances. Various combinations of nodes, corresponding to secondary-structural clusters, have been examined for structural compactness and listed along with their disjoint factor (see Sowdhamini & Blundell, 1995, for details). Domain boundaries for all these possibilities can be accessed by clicking on that entry. Further, detailed outputs can be accessed for the `best' combination. The `best' combination is usually the one with the highest disjoint factor (Df) without any secondary structures being ignored (-Nst. column shows the number of secondary structures that are ignored while examining various nodes in the dendrogram). The protein chain can be viewed using RASMOL (Sayle & Milner-White, 1995) where domains are coloured differently in the case of multi-domain proteins.
DDBASE can be used to trace similarities where particular domains are shared between proteins. It is especially useful where there are discontinuous domains. 400 large (with seven or more secondary structures) domains can be grouped into 30 classes on the basis of the structural similarity estimated from structural environments of individual secondary structures (Rufino & Blundell, 1994; Sowdhamini et al., 1996). The clustering of individual protein domains into structurally similar classes can also be examined on the DDBASE WWW page.
HOMSTRAD and CAMPASS are databases of structure-based alignments of protein sequences, grouped into homologous families and superfamilies, respectively. Aligned sequences of families of homologous protein structures are available in HOMSTRAD (Overington et al., 1990, 1993) and categorized according to the secondary-structural classes. There are 130 homologous protein families with at least two members in the March 1998 version. The sequences of homologous proteins within a family are initially aligned using the rigid-body superposition program MNYFIT (Sutcliffe et al., 1987) or COMPARER (Sali & Blundell, 1990; Zhu et al., 1992) and later subjected to a careful manual examination. Similar types of information are available for CAMPASS, the database of protein (domain)s belonging to superfamilies (RS, Sowdhamini et al., 1998). Superfamilies of structural domains were selected initially on the basis of structural environ-ment at secondary structural units (Rufino & Blundell, 1994; Sowdhamini et al., 1996). The selection of superfamilies has been extended by referring to SCOP (Murzin et al., 1995) and by including smaller domains like the cystine-knots, not considered earlier in the clustering analysis since they were not easy to compare using automatic structure-based procedures. 367 of 451 superfamilies annotated in SCOP have single families (Brenner et al., 1997; the more recent February 1998 release of SCOP has 419 of the 571 superfamilies with single families). Superfamily members were chosen such that no two domains within a superfamily share more than 25% sequence identity (alignments of closely related proteins are available in HOMSTRAD). This cut-off is consistent with the DDBASE definition in choosing representative protein chains. A rigorous sequence-alignment program, COMPARER (Sali & Blundell, 1990; Zhu et al., 1992), was used to align the members of a superfamily on the basis of structural features and relationships, which are equivalenced using simulated annealing. Table 1 lists protein superfamilies, with at least two members within the above-defined cut-off of sequence identity, whose alignments have been compiled in the March 1998 version. This includes 67 multi-member superfamilies which involves 293 domains representing 464 homologous proteins. There are a further 357 superfamilies, annotated in SCOP, which have single members (Murzin et al., 1995; Brenner et al., 1997). A few other multi-member superfamilies included in SCOP, such as the DNA-binding HMG box, pheromones, annexins and insulin-superfamily, were excluded from CAMPASS as members exhibited more than 25% sequence identity.
+This family is yet to be added in the homologous alignment database.
The WWW site of HOMSTRAD (Mizuguchi et al., 1998) provides a page for each of the families. The name of the protein, source, resolution and R factor are given for each family member corresponding to a PDB entry. The alignment of sequences is formatted in JOY (Overington et al., 1990) which highlights the conservation of local-residue structural features such as secondary structure, solvent accessibility and hydrogen bonding. Fig. 3 shows the alignment of cytochrome c from different sources and its homologues (cytochrome c2 and cytochrome c550), as an example.
| || Figure 3 |
HOMSTRAD database. Structure-based alignment of proteins in the family of cytochrome c. The first four characters of the code of the protein corresponds to the PDB code. Numbers in brackets correspond to residue numbers and residues are shown in single letter code. The alignment has been formatted using JOY (Overington et al., 1990). The conserved helices are important to the structural integrity of the proteins; functionally important residues (for example CXXCH, residue number 13 of 1ycc) are conserved. Residues are classified into two categories: those which are in the interior and those which are solvent-exposed (with solvent accessibility (ASA) values more than 7% (Hubbard & Blundell, 1987). In the sequence alignment, the solvent-exposed and solvent-buried residues are shown in lower case and upper case, respectively. Residues which have a positive value and a cis-peptide bond in their backbone conformation are shown in italics and with a breve accent on top, respectively. Disulfide-bonded cystine residues are shown by a cedilla symbol. Hydrogen bonding to other side chains, main-chain amides and main-chain carbonyl groups are shown by a tilde (indicated in non-HTML files), in bold and underlined, respectively. Residues in -strands, -helices and 3(10)-helices are shown in blue, red and maroon, respectively.
CAMPASS, on the WWW, provides information on the superfamilies: for each superfamily member, the name, source, resolution and domain boundaries are given. The beginning and end residue numbers for each segment of discontinuous domains are recorded. The pairwise percentage identity matrix of the members is provided. The structure-based alignment in the JOY-annotated form (Overington et al., 1990), similar to that described in HOMSTRAD, is shown and also available for extraction in the form of PostScript files, or as LATEX or HTML files or as a plain text file. Fig. 4 shows the alignment of the cytochrome superfamily as an example. A single representative (1ycc) of the nine cytochrome homologues (see above and Fig. 3) has been aligned with rather distantly related cytochromes such as cytochrome c6 and c551. The structures of the proteins within a family/superfamily have been superposed using MNYFIT (Sutcliffe et al., 1987), where the equivalent residues correspond to the final alignment. These superposed structures can be viewed on the WWW using the RASMOL graphics interface (Sayle & Milner-White, 1995).
| || Figure 4 |
CAMPASS database. Structure-based alignment of the cytochrome superfamily including distantly related proteins such as c550. Helix 2 of 1ycc, conserved within the homologues (see Fig. 3), occurs as an insertion in this alignment. Despite poor sequence identity, the functionally important residues (CXXCH) are conserved amongst the members in this superfamily.
Fig. 5 shows the distribution of pairwise percentage identities in the two alignment databases. Protein pairs in HOMSTRAD have a broad range of pairwise sequence identities with a slightly bimodal distribution (237 pairs have sequence identities between 25 and 30% and 121 pairs have sequence identities between 60 and 65% out of a total of 1962 pairs). However, the majority of homologous proteins in the database have sequence identities between 15 and 65%. The distribution of pairwise sequence identity of members within superfamilies (CAMPASS) is restricted to a maximum of 25%. A vast majority of protein pairs (449 out of 665) have pairwise percentage identities between 5 and 15%.
| || Figure 5 |
Distribution of pairwise percentage sequence identities amongst members in the homologue alignment database (HOMSTRAD) and superfamily alignment database (CAMPASS).
HOMSTRAD and CAMPASS are distinct from but complementary to other databases. SCOP (Murzin et al., 1995) has classified the entire Protein Data Bank at different levels of structural hierarchy and structural domains are defined. There is emphasis on functionality in the clustering of folds. SCOP does not attempt to perform or present sequence or structural alignments. CATH (Orengo et al., 1993, 1994) was originally designed and developed for whole proteins where the authors had taken particular caution to exclude multi-domain proteins. Subsequently, the structures have been systematically classified at the level of domains (Orengo et al., 1997). CATH does not include structure-based alignments of sequences. FSSP (Holm & Sander, 1994) is most similar to HOMSTRAD and CAMPASS due to the fact that FSSP also provides structure-based sequence alignments, even incorporating remote homologues. However, the alignments do not distinguish homologues and superfamilies from those which only share a similar fold. The databases described in this paper contain structure-based alignments that have been specially annotated to describe the structural environment at residue positions. This should provide extra information useful in the comparison of protein structures.
Bairoch, A. (1991). Nucleic Acids Res. 19, 2013-2018.
Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F., Brice, M. D., Rodgers, J. R., Kennard, O., Shimanouchi, T. & Tasumi, M. (1977). J. Mol. Biol. 112, 535-542.
Blundell, T. L. & Humbel, R. E. (1980). Nature (London), 287, 781-787.
Bork, P., Ouzounis, C. & Sander, C. (1994). Curr. Opin. Struct. Biol. 4, 393-403.
Bork, P., Ouzounis, C., Sander, C., Scharf, M., Schneider, R. & Sonnhammer, E. (1992). Nature (London) 358, 287-287.
Brenner, S. E., Chothia, C. & Hubbard, T. J. P. (1997). Curr. Opin. Struct. Biol. 7, 369-376.
Chothia, C. (1984). Ann. Rev. Biochem. 53, 537-572.
Crippen, G. M. (1978). J. Mol. Biol. 126, 315-332.
Go, M. (1981). Nature (London), 291, 90-92.
Hobohm, U., Scharf, M., Schneider, R. & Sander, C. (1992). Protein Sci. 1, 409-417.
Holm, L. & Sander, C. (1994). Nucleic Acids Res. 22, 3600-3609.
Hubbard, T. J. P. & Blundell, T. L. (1987). Protein Eng. 1, 159-171.
Islam, S. A., Luo, J. & Sternberg, J. E. (1995). Protein Eng. 8, 513-525.
Johnson, M. S., Overington, J. P. & Blundell, T. L. (1993). J. Mol. Biol. 231, 735-752.
Koonin, E. V., Bork, P. & Sander, C. (1994). EMBO J. 13, 493-503.
Mizuguchi, K., Deane, C., Overington, J. P. & Blundell, T. L. (1998). Protein Sci. In the press.
Murzin, A. G. (1996). Curr. Opin. Struct. Biol. 6, 386-394.
Murzin, A. G. & Chothia, C. (1992). Curr. Opin. Struct. Biol. 2, 895-903.
Murzin, A. G., Brenner, S. E., Hubbard, T. & Chothia, C. (1995). J. Mol. Biol. 247, 536-540.
Nichols, W. L., Rose, G. D., Eyck, L. F. T & Zimm, B. H. (1995). Proteins, 23, 38-48.
Orengo, C. A., Flores, T. P., Taylor, W. R. & Thornton, J. M. (1993). Protein Eng. 6, 485-500.
Orengo, C. A., Jones, D. T. & Thornton, J. M. (1994). Nature (London), 372, 631-634.
Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. & Thornton, J. M. (1997). Structure, 5, 1093-1108.
Overington, J. P., Johnson, M. S., Sali, A. & Blundell, T. L. (1990). Proc. R. Soc. London Ser. B, 241, 132-145.
Overington, J. P., Zhu, Z.-Y., Sali, A., Johnson, M. S., Sowdhamini, R., Louie, G. V. & Blundell, T. L. (1993). Biochem. Soc. Trans. 21, 597-604.
Richardson, J. S. (1981). Adv. Protein Chem. 34, 167-339.
Rose, G. D. (1979). J. Mol. Biol. 134, 447-470.
Rossmann, M. G. & Argos, P. (1977). J. Mol. Biol. 109, 99-129.
Rufino, S. D. & Blundell, T. L. (1994). Comput. Aided Mol. Design, 8, 5-27.
Sali, A. & Blundell, T. L. (1990). J. Mol. Biol. 212, 403-428.
Sayle, R. A. & Milner-White, E.J. (1995). Trends Biochem. Sci. 20, 374-376.
Schulz, G. E. (1977). Angew. Chem. Intl Ed. 16, 23-33.
Siddiqui, A. S. & Barton, G. J. (1995). Protein Sci. 4, 872-884.
Sowdhamini, R. & Blundell, T. L. (1995). Protein Sci. 4, 506-520.
Sowdhamini, R., Burke, D. F., Huang, J.-F., Mizuguchi, K., Nagarajaram, H. J., Srinivasan, N., Steward, R. E. & Blundell, T. L. (1998). Structure. In the press.
Sowdhamini, R., Rufino, S. D. & Blundell, T. L. (1996). Folding Design, 1, 209-220.
Sutcliffe, M. J., Haneef, I., Carney, D. & Blundell, T. L. (1987). Protein Eng. 1, 377-384.
Swindells, M. B. (1995). Protein Sci. 4, 103-112.
Wetlaufer, D. B. (1973). Proc. Natl Acad. Sci. USA, 70, 697-701.
Wodak, S. J. & Janin, J. (1981). Biochemistry, 20, 6544-6553.
Zehfus, M. H. & Rose, G. D. (1986). Biochemistry, 25, 5759-5765.
Zhu, Z.-Y., Sali, A. & Blundell, T. L. (1992). Protein Eng. 5, 43-51.