research papers
SCOP, Structural Classification of Proteins Database: Applications to Evaluation of the Effectiveness of Sequence Alignment Methods and Statistics of Protein Structural Data
aSanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SA, England, bCentre for Protein Engineering, MRC Centre, Hills Road, Cambridge CB2 2QH, England, cDepartment of Structural Biology, Stanford University, Stanford, CA 94305-5400, USA, and dLaboratory of Molecular Biology, MRC Centre, Hills Road, Cambridge CB2 2QH, England
*Correspondence e-mail: th@sanger.ac.uk
The Structural Classification of Proteins (SCOP) database provides a detailed and comprehensive description of the relationships of all known protein structures. The classification is on hierarchical levels: the first two levels, family and superfamily, describe near and far evolutionary relationships; the third, fold, describes geometrical relationships. The distinction between evolutionary relationships and those that arise from the physics and chemistry of proteins is a feature that is unique to this database, so far. The database can be used as a source of data to calibrate sequence search algorithms and for the generation of population statistics on protein structures. The database and its associated files are freely accessible from a number of WWW sites mirrored from URL http://scop.mrc-lmb.cam.ac.uk/scop/.
1. Introduction
At present (April 1998) the Brookhaven Protein Data Bank (PDB, Abola et al., 1987) contains 7435 entries and the number is increasing by about 200 a month. These proteins have structural similarities with other proteins and, in many cases, share a common evolutionary origin. To facilitate access to this information, we have constructed the Structural Classification of Proteins (SCOP) database (Murzin et al., 1995). It includes not only all proteins in the current version of the PDB, but many proteins for which there are published descriptions but whose coordinates are not yet available.
The classification of proteins in SCOP has been constructed by visual inspection and comparison of structures. Given the current limitations of purely automatic procedures, we believe this approach produces the most accurate and useful results. The unit of classification is usually the protein domain. Small proteins, and most of those of medium size, have a single domain and are, therefore, treated as a whole. The domains in large proteins are usually classified individually.
The classification of the proteins is on hierarchical levels.
1.1. Family
Proteins are clustered together into families on the basis of one of two criteria that imply their having a common evolutionary origin: first, all proteins that have significant sequence similarity; second, proteins with lower sequence identities; but whose functions and structures are very similar; for example, globins with sequence identities of 15%.
1.2. Superfamily
Families, whose proteins have low sequence identities but whose structures and, in many cases, functional features suggest that a common evolutionary origin is probable, are placed together in superfamilies; for example, the variable and constant domains of immunoglobulins.
1.3. Common fold
Superfamilies and families are defined as having a common fold if their proteins have same major secondary structures in same arrangement and with the same topological connections (for recent reviews see Orengo, 1994; Murzin, 1994). The structural similarities of proteins in the same fold category, probably arise from the physics and chemistry of proteins favouring certain packing arrangements and chain topologies.
1.4. Class
The different folds have been grouped into classes. Most of the folds are assigned to one of the five structural classes.
(i) All-α, those whose structure is essentially formed by α-helices;
(ii) all-β, those whose structure is essentially formed by β-sheets;
(iii) α/β, those with α-helices and β-strands;
(iv) α+β, those in which α-helices and β-strands are largely segregated; and
(v) multi-domain, those with domains of different fold and for which no homologues are known at present.
small proteins, theoretical models, and These hierarchical levels are illustrated in Fig. 1There are now a number of other databases which classify protein structures, such as CATH (Orengo et al., 1993, 1997), FSSP (Holm & Sander, 1994, 1996), Entrez (Hogue et al., 1996) and DDBASE (Sowdhamini et al., 1996), however the distinction between evolutionary relationships and those that arise from the physics and chemistry of proteins is a feature that is unique to SCOP, so far. Because functional similarity is implied by an evolutionary relationship but not necessarily by a physical relationship, we believe that this classification level is of considerable value, for example as a way of linking very distant sequence families reliably.
2. Steps used to classify proteins in SCOP
The following description outlines the major steps in the classification of protein structures at the different levels listed above.
Computational methods are used to aid the classification process, however the information they provide is incomplete and so final decisions in all cases are the result of manual inspection. For example, sequence comparison is used to automatically detect relationships between parts of new structures and domains already classified, however it fails to identify many of the structural relationships in SCOP either because the sequence relationship has become too weak (for evolutionarily related proteins) or never existed (for evolutionarily unrelated proteins with similar folds). Structure–structure comparison programs can identify domains of similar structure, however manual inspection is required to verify the choice of fold as frequently several similar but distinct folds are identified. The assignment of proteins of known structure to evolutionarily related superfamilies is perhaps the single most powerful and important feature of the database, but is the one most reliant on the manual procedures described below as current computational methods are almost entirely unhelpful in this regard.
2.1. Domain and class
The first step in the classification of a protein is to divide it, where necessary, into domains. The basic idea of a domain is a region of a protein which has its own hydrophobic core and has relatively little interaction with the rest, so that it is essentially structurally independent. Identification of domains is not trivial and can frequently be performed correctly only by using evolutionary information to see, for example, how domains have been `shuffled' in different proteins. Typically domains are collinear in sequence, but occasionally one domain will have another `inserted' into it, or two homologous domains will intertwine by swapping some topologically equivalent parts of their chains.
Where domains can be identified (which in many cases will be the entire protein chain) these are placed in classes based on whether their cores consist exclusively of α-helices, β-sheets, or some mixture. In some borderline cases a domain could be argued to fit equally well in more than one class, so for this reason class should be regarded as mainly for convenience of browsing and not always an unambiguous definition.
Because of the problem of identifying domains on the basis of a single protein structure there is a multi-domain class. Proteins here have multiple domains which have never been seen independently of each other, so accurate determination of their boundaries is not possible and perhaps not meaningful or useful. This is seen as a transient class, as proteins found here will be classified elsewhere as soon as evidence for their domain boundaries emerges.
There are also classes for proteins and domains which are not globular, soluble structures stabilized by the packing of α-helices and β-sheets. These are `small proteins', for those proteins which structure stabilized by disulfide bridges or by metal ligands rather than by hydrophobic core; `membrane proteins'; `short `theoretical models'; and `non-proteins', for entries in the Protein Data Bank such as nucleic acids.
2.2. Folds
Structural–structure similarity programs such as DALI, available via a World Wide Web (WWW) server (Holm & Sander, 1995), allow similarities to be identified in many cases, however interpreting the results is not always straightforward. There are now many proteins with similar, but distinct folds and topological similarity may not be sufficient. The approach used for SCOP to characterize a fold is to look first at the major architectural features, and then identify more subtle characteristics. Where folds appear similar, but protein structures do not superimpose well, these proteins cannot be classified as having the same fold or superfamily. Topological similarities of this kind are on an intermediate level between class and fold, and, in the current version of SCOP, they are silently indicated by listing folds with similar topologies together on the class page. This approach is also used to segregate different architectural motifs, like two-sheet sandwiches and single-sheet barrels in the all-β class. Future versions of SCOP will include the necessary additional levels of classification to make such distinctions explicit.
2.3. Superfamilies
Protein structures classified in the same superfamily are probably related evolutionarily and, therefore, they must share a common fold and usually perform similar functions. If the functional relationship is sufficiently strong, for example, the conserved interaction with substrate or cofactor molecules, the shared fold can be relatively small, provided it includes the active site (for example, Bycroft et al., 1997). It is in contrast with classification on the fold level, which ordinarily requires greater structural similarity.
3. Organization and facilities of SCOP
The SCOP database is available as a set of tightly coupled hypertext pages on the WWW via URL http://scop.mrc-lmb.cam.ac.uk/scop/. The interface to SCOP has been designed to facilitate both detailed searching of particular families and browsing of the whole database. To this end, there are a variety of different techniques for navigation.
3.1. Browsing through the SCOP hierarchy
SCOP is organized as a tree structure. Entering at the top of the hierarchy the user can navigate through the levels of class, fold, superfamily, family and species to the leaves of the tree which are structural domains of individual PDB entries. An alternative hierarchy of folds, superfamilies and families by the date of solution of the first representative structure is also provided.
3.2. From an amino-acid sequence
The sequence similarity search facility allows any sequence of interest to be searched against databases of protein sequences classified in SCOP (see below) using algorithms BLAST (Altschul et al., 1990), FASTA or SSEARCH (Pearson, 1996). SCOP can then be entered from the list of PDB chains found to be similar and the similarity can be displayed visually (see Fig. 2).
3.3. From a keyword
The keyword search facility returns a list of SCOP pages containing the word entered or combinations of words separated by a series of Boolean operators.
3.4. From a PDB identifier
The PDB entry viewer links PDB entries to various graphical views, external databases and SCOP itself.
3.5. By history
Pages are provided that order folds, superfamilies and families by date of entry into PDB or publication. This is both for interest and to make it easier to keep up to date with the appearance of new folds or significant new members of existing folds. In addition to the information on structural and evolutionary relationships contained within SCOP, each entry (for which coordinates are available) has links to images of the structure, interactive molecular viewers (Fig. 2), the atomic coordinates, data on functional conformational changes, sequence data and homologues and MEDLINE abstracts (see Table 1).
|
To facilitate rapid and effective access to SCOP, a number of mirrors have been established, a full current list of which can be found via the URL above. The facilities provided by the various sites are always the same, so you will lose nothing by accessing your nearest mirror. The implementation does differ: for example currently sequence similarity searching is always carried out at the main scop.mrc-lmb.cam.ac.uk site, however this is transparent to the user who will always be returned a search results page marked up with links to pages on the mirror that they started from.
4. Evaluating the effectiveness of sequence-alignment methods
Sequence database searching plays a role in virtually every branch of molecular biology and is crucial for interpreting the sequences issuing forth from genome projects. Despite this the overall and relative capabilities of different search procedures have until recently been largely unknown. This is because it is difficult to verify algorithms on sample data as this requires large data sets of proteins whose evolutionary relationships are known unambiguously and independently of the methods being evaluated (nearly all known homologs have been identified by sequence analysis, the method to be tested). Also, it is generally very difficult to know, in the absence of structural data, whether two proteins that lack clear sequence similarity are unrelated. This has meant that although previous evaluations have helped improve sequence comparison, they have suffered from insufficient, imperfectly characterized, or artificial test data (see Brenner et al., 1998).
As part of the maintenance of SCOP, new structures are automatically processed. One of the initial steps is to cluster the sequences of protein chains of known structures at different levels of sequence similarity. This has resulted in a series of non-redundant sequence databases, referred to as PDB40, PDB90, PDB95 (Fig. 3a), where the number refers to percentage sequence identity as modified by the HSSP equation (Sander & Schneider, 1991) and where the chain chosen as the representative is that with the best structural `quality' defined from an equation combining resolution, R factor and PROCHECK values (Laskowski et al., 1993). The final SCOP classification is used to annotate the headers of these FASTA format files and to split them into domains. The result is a set of domain sequence databases, PDB40D, PDB90D etc. where the full set of true and false pairwise relationships between the sequences can be inferred from the scopcode in the headers (Fig. 3b). These databases are used within SCOP for the sequence search facility (see above and Fig. 2), however they are also ideally suited as test data for the calibration of sequence searching algorithms.
The databases are used for calibration in the following way. Using the algorithm to be tested, an all-against-all search is performed, i.e. each sequence in the database is searched against every other sequence. The entire set of results from all the database searches are then ranked together using the scoring scheme to be evaluated. For a database of 1323 sequences (e.g. PDB40D-B) the ranked list could contain as many as 874 503 distinct pairwise comparisons, however only 4522 represent true relationships (Brenner et al., 1998). Two cumulative scores are generated moving a threshold down the list from the best score to the worst: the fraction of the total number of `true' pairwise relationships that lie above the threshold (the coverage) and the fraction of the relationships in the list that are false (the accuracy). Plotting these two values as a coverage/accuracy plot, it is possible to compare the performance of different algorithms and establish the score threshold that relates to a given accuracy (Fig. 4).
Calibration of the commonly used algorithms BLAST (Altschul et al., 1990), WU-BLAST2 (Altschul & Gish, 1996), FASTA and SSEARCH (Pearson, 1996) revealed three key conclusions that are of practical use for those carrying out sequence database searches (Brenner et al., 1998).
4.1. Algorithm
Given an error limit of 1% SSEARCH detected the most distant relationships, with FASTA ktup = 1 and WU-BLAST2 being almost as good (Fig. 4). FASTA ktup = 1 is more computationally expensive than BLAST (~4 times slower) and SSEARCH is even more so (~25 times slower than BLAST).
4.2. Scoring
Statistical scoring schemes (P values and E values) produced the best results. Sequence identity was found to be a very poor measure of similarity, with examples of long alignments between unrelated protein structures having high percentage identity (e.g. 39% over 64 residues, 36% over 74 residues and 34% over 85 residues). However, whereas the empirical implementation of E values in FASTA/SSEARCH fairly accurately reflected the true error rate the analytical implementation of P values in BLAST (Karlin & Altschul, 1990, 1993) overestimated the likelihood of a match being correct by several orders of magnitude. Both E values and P values are based on extreme value distributions, the difference between them being that P values can be thought of as the probability that an alignment is incorrect (i.e. are corrected for database size), whereas E values represent raw expected errors per query (i.e. not corrected for database size).
4.3. Coverage
The coverage of even the best algorithm was remarkably low: only 18% of relationships in the PDB40D database are identified when applying the 1% error-rate threshold with the most sensitive algorithm tested (SSEARCH) and the most discriminating scoring function (E values). Thus, if the procedures assessed here fail to find a reliable match, it does not imply that the sequence is unique; rather, it indicates that any relatives it might have are distant ones.
Knowing the meaning of the score of an alignment has become even more critical in the current era of genome analysis, where there are too many sequence comparisons to evaluate each manually. Applying the results of this calibration it has been possible to evaluate the distribution of families of proteins in whole genomes with confidence (Brenner et al., 1995).
This calibration scheme has also been used to evaluate more sophisticated approaches to sequence searching. It has been anecdotal that `intermediate' sequences can be used to link more distantly related proteins, i.e. first carry out a database search against the sequence of interest and then carry out database searches with each sequence returned from the first search. Calibration against PDB40D showed that using the same algorithm (FASTA) this approach increases the coverage by ~70% when applying the 1% error-rate threshold (Park et al., 1997). Work to evaluate sequence search methods relying on multiple sequence alignments such as Hidden Markov Models (Eddy, 1996; Krogh et al., 1994) and the recently developed iterative version of BLAST2 (Altschul et al., 1997) (referred to as psi-BLAST) have shown significantly better performance by the same criteria (Park et al., unpublished results; Brenner et al., in preparation).
The databases used for these studies are now freely available via the SCOP URL and the format of their headers is shown in Fig. 3.
5. Statistics of protein structural data
With structural data conveniently organized into domains, it is straightforward to investigate the population statistics of the protein structures we currently know. A recent survey of the classification in SCOP (Brenner et al., 1997) clearly shows that even after the high degree of redundancy in PDB has been taken into account, the frequency of occurrence of certain folds is much greater than would be expected by chance, as has been pointed out previously (Orengo et al., 1994). Recalculation of the tables shown there for the most recent version of SCOP (1.37), which contains 20% more domains but only 11% more folds, shows an essentially similar picture.
The raw data to explore the classification in this way can of course be extracted from the SCOP WWW pages (if one likes writing HTML parsers) however there is an easier way in the form of the flat file shown in Fig. 3(c). This lists all domains classified in SCOP, not just the subset of protein chains which are defined in the headers of the FASTA format files listed above, and can again be accessed from the SCOP URL.
6. Conclusions
We have found that the easy access to data and images provided by SCOP make it a powerful general-purpose interface to the PDB (Brenner et al., 1995). The specific lower levels should be helpful for comparing individual structures with their evolutionary and structurally related counterparts. On a more general level, the highest levels of classification provide an excellent overview of the diversity of protein structures now known and would be appropriate both for researchers and students. Having created the classification we have found that it has many other uses, some of which have been listed here. We hope that other researchers will find yet more uses for the raw data files that are now provided with each release.
Acknowledgements
TJPH is grateful to the MRC/DTI/ZENECA LINK programme and AGM is grateful to the MRC for financial support. SEB is grateful for support from a Sloan/DOE fellowship in computational molecular biology.
References
Abola, E., Bernstein, F. C., Bryant, S. H., Koetzle, T. F. & Weng, J. (1987). Crystallographic Databases – Information Content, Software Systems, Scientific Applications, edited by F. H. Allen, G. Bergerhoff & R. Sievers, pp. 107–132. Bonn/Cambridge/Chester: IUCr.
Altschul, S. F. & Gish, W. (1996). Methods Enzymol. 266, 460–480. CrossRef CAS PubMed Web of Science
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & D. J. Lipman. (1990). J. Mol. Biol. 215, 403–410.
Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J. H., Zhang, Z., Miller, W. & Lipman, D. J. (1997). Nucleic Acids Res. 25, 3389–3402. CrossRef CAS PubMed Web of Science
Appel, R. D., Bairoch, A. & Hochstrasser, D. F. (1994). Trends Biol. Sci. 19, 258–260. CrossRef CAS Web of Science
Benson, D., Lipman, D. J. & Ostell, J. (1993). Nucleic Acids Res. 21, 2963–2965. CrossRef CAS PubMed Web of Science
Berman, H. M., Olson, W. K., Beveridge, D. L., Westbrook, J., Gelbin, A., Demeny, T., Hsieh, S. H., Srinivasan, A. R. & Schneider, B. (1992). Biophys. J. 63, 751–759. CrossRef PubMed CAS Web of Science
Brenner, S. E., Chothia, C. & Hubbard, T. J. P. (1997). Curr. Opin. Struct. Biol. 7, 369–376. CrossRef CAS PubMed Web of Science
Brenner, S. E., Chothia, C. & Hubbard, T. J. P. (1998). Proc. Natl Acad. Sci. USA, 95, 6073–6078. Web of Science CrossRef CAS PubMed
Brenner, S. E., Chothia, C., Hubbard, T. J. P. & Murzin, A. (1995). In Computer Methods for Macromolecular Sequence Analysis, edited by R. F. Doolittle. Orlando: Academic Press.
Brenner, S. E., Hubbard, T. J. P., Murzin, A. & Chothia, C. (1995). Nature (London), 378, 140. CrossRef PubMed Web of Science
Bycroft, M., Hubbard, T. J. P., Proctor, M., Freund, S. M. V. & Murzin, A. G. (1997). Cell, 88, 235–242. CrossRef CAS PubMed Web of Science
Eddy, S. R. (1996). Curr. Opin. Struct. Biol. 6, 361–365. CrossRef CAS PubMed Web of Science
Fitzgerald, P. C. (1994). In WWW94, First International Conference on the World Wide Web, Chemistry Workshop, Elsevier Science BV, CERN, Geneva, Switzerland.
Gerstein, M., Lesk, A.M. & Chothia, C. (1994). Biochemistry, 33, 6739–6749. CrossRef CAS PubMed Web of Science
Hogue, C., Ohkawa, H. & Bryant, S. H. (1996). Trends Biochem. Sci. 21, 226–229. CrossRef CAS PubMed Web of Science
Holm, L. & Sander, C. (1994). Nucleic Acids Res. 22, 3600–3609. CAS PubMed Web of Science
Holm, L. & Sander, C. (1995). Trends Biochem. Sci. 20, 478–480. CrossRef CAS PubMed Web of Science
Holm, L. & Sander, C. (1996). Science, 273, 595–602. CrossRef CAS PubMed Web of Science
Karlin, S. & Altschul, S. F. (1990). Proc. Natl Acad. Sci. USA, 87, 2264–2268. CrossRef CAS PubMed Web of Science
Karlin, S. & Altschul, S. F. (1993). Proc. Natl Acad. Sci. USA 90, 5873–5877. CrossRef CAS PubMed Web of Science
Krogh, A., Brown, M., Mian, I. S., Sjolander, K. & Haussler, D. J. (1994). J. Mol. Biol. 235, 1501–1531. CrossRef CAS PubMed Web of Science
Laskowski, R. A., Macarthur, M. W., Moss, D. S. & Thornton, J. M. (1993). J. Appl. Cryst. 26, 283–291. CrossRef CAS Web of Science IUCr Journals
Moult, J., Bryant, S. H., Fidelis, K., Hubbard, T. J. P. & Pedersen, J. T. (1997). Proteins, S1, 3–6.
Murzin, A. G. (1994). Curr. Opin. Struct. Biol. 4, 441–449. CrossRef CAS Web of Science
Murzin, A., Brenner, S. E., Hubbard, T. J. P. & Chothia, C. (1995). J. Mol. Biol. 247, 536–540. CrossRef CAS PubMed
Orengo, C. (1994). Curr. Opin. Struct. Biol. 4, 429–440. CrossRef CAS Web of Science
Orengo, C. A., Flores, T. P., Taylor, W. R. & Thornton, J. M. (1993). Protein Eng. 6, 485–500. CrossRef CAS PubMed Web of Science
Orengo, C. A., Jones, D. T. & Thornton, J. M. (1994). Nature (London), 372, 631–634. CrossRef CAS PubMed Web of Science
Orengo, C. A., Michie, A. D., Jones, S., Jones, D. T., Swindells, M. B. & Thornton, J. M. (1997). Structure, 5, 1093–1108. CrossRef CAS PubMed Web of Science
Park, J. H., Teichmann, S. A., Hubbard, T. J. & Chothia, C. (1997). J. Mol. Biol. 273, 349–354. CrossRef CAS PubMed Web of Science
Pearson, W. R. (1996). Methods Enzymol. 266, 227–258. CrossRef CAS PubMed Web of Science
Sander, C. & Schneider, R. (1991). Proteins, 9, 56–68. CrossRef CAS PubMed Web of Science
Sayle, R. A. & Milner-White, E. J. (1995). Trends Biochem. Sci. 20, 374–376. CrossRef CAS PubMed Web of Science
Sowdhamini, R., Rufino, S. D. & Blundell, T. L. (1996). Folding Des. 1, 209–220. CrossRef CAS Web of Science
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.