research papers
Protein Sequence Alignment Techniques
aEuropean Molecular Biology Laboratory Outstation, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, England
*Correspondence e-mail: geoff@ebi.ac.uk
The basic algorithms for alignment of two or more protein sequences are explained. Alternative methods for scoring substitutions and gaps (insertions and deletions) are described, as are global and local alignment methods. Multiple alignment techniques are explained, including methods for profile comparison. A summary is given of programs for the alignment and analysis of protein sequences, either from sequence alone, or from three-dimensional structure.
1. Introduction
Sequence-alignment techniques are central to modern molecular biology. Alignment methods may be used to search sequence databases for potential homologues. Having identified a family of sequences, then alignment methods may be used to generate a multiple sequence alignment that allows the identification of structurally and functionally important residues. Many different programs for pair-wise and multiple sequence alignment exist. In this article the underlying algorithms used by these programs are summarized.
2. Alignment of two sequences
The most commonly used sequence-alignment techniques have three components. The scoring scheme, the gap model and the alignment-optimization algorithm.
2.1. The scoring scheme, substitution or pair-score matrix
The scoring scheme is a 20 × 20 matrix of numbers that defines the value for aligning each of the possible amino-acid pairs. The term substitution is often used for the alignment of two amino-acid residues, since scoring schemes are frequently derived from a model of evolution that considers two protein sequences to be related via a series of point mutations. The pair-score matrix is usually symmetrical, since Ala aligned with Gly has the same meaning as Gly aligned with Ala. The simplest scoring scheme is the identity matrix. This scores 1 for an exact match of two amino acids, and 0 for a mismatch. Although the identity matrix is appealing in its simplicity, it does not reflect adequately similarities observed between proteins that have similar three-dimensional structures. More sophisticated schemes take into account conservative substitutions. For example, Val aligned with Leu might score +4, but Glu with Leu, −3. Until recently, matrices referred to as PAM or Dayhoff were the most widely used. PAM matrices were derived by first aligning a small number of families of protein sequences by eye, then counting the observed amino-acid substitutions within the families and normalizing the counts before extrapolating the observed substitutions to those expected at different evolutionary distances (Dayhoff et al., 1978). The measure of evolutionary distance used was the percentage of accepted mutations, or PAM, and the most commonly applied matrix was that at 250 PAMS, normally known as PAM250.
In spite of its small training-set size, the PAM250 matrix captures the principal physico-chemical properties of the amino acids (Taylor, 1986a). Furthermore, updates to the PAM matrix obtained from much larger data sets, for example, the PET92 matrix (Jones et al., 1992), show few differences to PAM250, except for substitutions with the less common amino acids such as tryptophan.
The BLOSUM series of matrices (Henikoff & Henikoff, 1993) are also derived from an analysis of observed substitutions in protein families. Unlike PAM matrices, the starting point for BLOSUM is a set of alignments without gaps, obtained by the BLAST (Altschul et al., 1990) algorithm (Henikoff & Henikoff, 1992). The alignments include sequences that share much lower sequence similarity than those used in the Dayhoff studies. In extensive tests of sequence database searching (Henikoff & Henikoff, 1992), pair-wise alignment (Vogt et al., 1995) and multiple sequence alignment (Raghava & Barton, 1998) the BLOSUM series of matrices on average give results superior to the PAM matrices and most other matrices. For this reason, BLOSUM matrices are now the general matrix of choice for protein sequence alignment, and are the default matrices used by most popular sequence-alignment and database-searching software.
Rather than starting from alignments generated by sequence comparison, Overington et al. (1992) only consider proteins for which an experimentally determined three-dimensional structure is available. They then align similar proteins on the basis of their structure rather than sequence and use the resulting sequence alignments as their database from which to gather substitution statistics. In principle, the Overington matrices should give more reliable results than either PAM or BLOSUM. However, the comparatively small number of available protein structures currently limits the reliability of their statistics. Overington et al. (1992) develop further matrices that consider the local environment of the amino acids.
2.2. Dealing with gaps
Insertions and deletions are observed within protein families, and it is normally necessary to introduce such indels when producing an alignment. The simplest scheme for gaps introduces a new character that scores u when aligned with any amino acid. Since gaps are comparatively rare, u is usually made negative. As a consequence, u is often referred to as the gap-penalty. In this simple scheme, a gap of ten residues is penalized ten times more highly than a gap of one residue. Within protein families, this makes little sense since gaps of more than one residue are needed to obtain structurally reasonable alignments. The most commonly used scoring scheme for gaps is a function of the form: ul + v, where l is the length of the gap in residues. This form of penalty function is referred to as affine and has efficiency advantages over more elaborate penalty functions (Gotoh, 1982). The constants v and u are often referred to as the penalties for creation and extension of the gap, or length-independent and length-dependent penalties, respectively. Gap penalties need not be uniform across the sequence and such position specific gap penalties are discussed in §2.4.
2.3. Finding the optimal alignment of two sequences
The optimal alignment of two protein sequences is the alignment that maximizes the sum of pair-scores less any penalty for introduced gaps. The problem is to find an efficient way of locating an alignment that satisfies these conditions. If no gaps are allowed, then the task is simple, just slide one sequence over the other and for each position, sum the pair-scores from the chosen matrix (e.g. BLOSUM62). This simple alignment algorithm requires of the order of 200 alternative alignments to be evaluated for two sequences of length 100. In contrast, if gaps are allowed, then there are >1075 alternative alignments that must be considered! An elegant solution to this intractable search problem is given by a class of computer algorithm known as dynamic programming. Dynamic programming allows the optimal alignment of two sequences to be found in of the order of mn steps, where m and n are the lengths of the sequences. Dynamic programming for sequence comparison was independently invented in several fields, many of which are discussed in Sankoff and Kruskal's book (Sankoff & Kruskal, 1983). An introduction to dynamic programming in the wider context of string comparison can be found in Gusfield (1997). Needleman & Wunsch (1970) are often attributed as the first application of dynamic programming in molecular biology, while slightly different formulations of the same algorithm were described by Sellers (1974) and Waterman et al. (1976). Here, the basic Sellers (Sellers, 1974) dynamic programming algorithm for two sequences is described, with a length-dependent gap-penalty. A simple example of this algorithm is illustrated and explained in Fig. 1. At each aligned position between two sequences , and of length m and n, there are three possible events,
where is the symbol for a single residue gap. The substitution score (wAi, Bj) is taken from the amino-acid pair-score matrix such as BLOSUM62 (Henikoff & Henikoff, 1992). The two deletion scores are the penalty for aligning a single residue with a single gap. The total score M for the alignment of A with B can be represented as . This is found by working along each protein sequence from the N to the C terminus, successively finding the best score for aligning with for all i, j where 1 and . The values of are stored in a matrix H where each element of H is calculated as follows:
Once the matrix is complete, the element Hm,n contains the total score for the alignment of the complete sequences.
The processing steps described above determine the best score possible for the alignment of the two sequences given the gap-penalty and pair-score matrix. However, they do not give the alignment. In order to generate an alignment of A and B that gives the best score, it is necessary to trace back through H. An efficient way of completing the trace-back is to save an array of pointers that records which of the three possibilities was the maximum at each Hi,j. It is possible for there to be ties for the maximum chosen at any Hi,j. Ties represent equally valid alternative alignments and the manner in which ties are resolved is dependent on the program implementation. For this reason, two computer programs that correctly claim to implement the same dynamic programming algorithm and give the same total score for the alignment of two sequences, may produce subtly different alignments.
Calculation of the best score is more complex if gap-penalties of the form ul + v are employed since it is necessary to determine for each cell whether a gap is starting at that cell, or continuing from an earlier cell. This algorithm typically runs a factor of three slower than the simple algorithm.
2.4. Position-specific gap-penalties: domains and secondary structure
In the simple example in the previous section, the penalty for a gap is equal at all locations in the alignment. However, it often makes sense to penalize gaps differently at the ends, or at different positions within each sequence. For example, if a protein domain is being aligned to a longer sequence that is known to contain the domain, the penalties at the end of the domain should be reduced to allow the domain to slide over the longer sequence. If the secondary structure of one protein in a pair to be aligned is known, then increasing the gap-penalty within core secondary-structure elements will reduce the likelihood of placing a gap in a secondary structure (Barton & Sternberg, 1987a; Lesk et al., 1986). Both changes require simple modifications to the algorithm. End gaps are adjusted by changing the gap-penalty constants for the 0th and last row and column of the H matrix. Position-specific gaps are set by having a vector of penalties P of length m rather than a single constant . This modifies the calculation of Hi,j to
In this example, the gap-vector P refers to sequence A. Thus, the weight for aligning any residue in A with a gap will depend on where the residue is in A. In contrast, aligning a residue in B with a gap is penalized equally irrespective of position.
There are many ways of modifying position-specific gap-penalties. For example P can be applied to gaps in both sequences, but dependent only on the position in A, so eliminating the fixed constant , or a second gap-penalty vector can be introduced for B.
2.5. Position-specific weights: profile comparison
In the examples in the previous sections, residue substitution weights have all come from a single pair-score matrix such as BLOSUM62. However, the substitution weights (wAi,Bj) can be made position-specific in the same way as gap-penalties. If the weights are position specific relative to A it means that the weight for matching a residue from A of a particular type to any other residue will depend upon the location of the residue in A as well as the type of the residue in B. For example, a Gly at A32 aligning with a Gly in B may have a weight of +7, while a Gly at A76 aligned with a Gly in B has a weight of −5. Position-specific weights are useful since they allow the importance of specific substitutions to be emphasized at particular positions along the sequence. This might mean increasing the weight for aligning a known active-site residue in A with a residue of the same type, or more general properties such as increasing the weight for known buried hydrophobic residues to align with similar residues in B.
In order to calculate an alignment score by dynamic programming that includes position-specific weights, a position-specific weight matrix or profile Qm,20 for sequence A must be defined. Q contains m rows where each row has 20 weights for substitutions with each amino-acid type at that position. Thus, Q3,S is the substitution weight for a serine in B with position 3 of A. The equivalent of having no position-specific weights is to populate each with the appropriate row from the BLOSUM62 or other pair-score matrix. However, if the power of position-specific weights is to be exploited, additional information about A must be included in the derivation of Q. This might be knowledge of the three-dimensional structure of the protein with sequence A, where position-specific weights reflect the local environment of the amino acids (e.g. Overington et al., 1992). Or more commonly, the frequencies of observed amino acids at each position in a multiple sequence alignment of sequences similar to A (Gribskov et al., 1987; Barton & Sternberg, 1987b, 1990). Hi,j when calculated with position-specific weights and gaps for A is modified to
Since their parallel development by many authors in the mid 1980's (Gribskov et al., 1987; Barton & Sternberg, 1987b; Taylor, 1986b), position-specific weighting schemes, or profiles, have formed the basis of many methods for sensitive sequence comparison. In addition, the principle of position-specific weights is at the heart of currently popular techniques such as Hidden Markov Models (HMM's) (Krogh et al., 1994) and generalized profiles (Bucher et al., 1996). A generalization of position-specific weight matrices is to create two Q matrices, one for each sequence. When calculating each element of H, the substitution weights may be combined by averaging. Comparison of two profiles is fundamental to hierarchical multiple alignment algorithms discussed in §3.3.
2.6. Local alignment algorithm
The general algorithm described in §2.3 is a global alignment algorithm. If it is not known in advance that the sequences are similar over most of their length, then a local alignment algorithm is preferred. To convert the basic algorithm to a local algorithm for a pair-score matrix that is centred on zero, the 0th row and column of H are initialized to zero, then the calculation of each element of H is modified to
The only difference is the addition of a zero (Smith & Waterman, 1981) which has the effect of terminating any path in the H matrix whose score drops below zero. The maximum local alignment score for comparison of A and B is given by the maximum value in H. The alignment is traced back from this cell until cells on the path drop to zero. There may be more than one local alignment between two sequences, perhaps representing repeats, or multiple shared domains. Algorithms exist to find most (Barton, 1993a) or all (Waterman & Eggert, 1987) of the alternative local alignments.
2.7. Alternative alignments for the same two sequences
The possibility of ties when constructing the H matrix, and hence alternative equally valid alignments has already been discussed in §2.3. A more general problem is that there may be many alternative alignments with scores close to the optimum. Any one of the alternative alignments could be the `true' biologically meaningful alignment and so it is useful to be able to generate all alignments `close' to the optimal alignment.
Saqi & Sternberg (1991) determine alternative sub-optimal alignments by first calculating the H matrix and best path. They then identify the cells that contributed to the best path and reduce these by a preset value (normally 10% of the typical scoring matrix value). A new H matrix is then calculated and a new best path and alignment. This process is repeated iteratively to generate a series of global sub-optimal alignments. Zuker (1991) and Vingron & Argos (1990) describe methods to visualize alternative sub-optimal alignments on a dot-plot. A simple way to probe for stable parts of an alignment is to generate an alignment with standard parameters, then modify the gap penalties slightly, re-align and observe which regions of the alignment change (if any).
2.8. Aligning sequences in linear space
The standard dynamic programming algorithm requires storage of at least one m × n matrix in order to calculate the alignment. On current computers, this is not a problem for protein sequences, but for large DNA sequences, or complete genomes, space requirements can be prohibitive. Linear-space algorithms for dynamic programming (Myers & Miller, 1988) overcome this problem by a recursive strategy, albeit at some sacrifice in execution time.
2.9. Comparing sequences without alignment
Few generally available programs allow the flexible investigation of alternative alignments. A way of avoiding the problem is to make use of one of the earliest methods of comparing two sequences, the dot-plot, or diagonal plot (Gibbs & McIntyre, 1970; Staden, 1982). Although the dot-plot does not give an alignment directly, it is an effective technique to view simultaneously all similarities between two sequences. Similarities may be obvious in a dot-plot, but missed entirely by a dynamic programming sequence alignment program that only displays the top-scoring alignment.
For two sequences A and B of length m and n, respectively, a matrix Dm,n is generated where each element Di,j represents the similarity between sequence segments centred on Ai and Bj. In its simplest form, Di,j = 1 if Ai = Bj and Di,j = 0 if . A graph is plotted with A and B on each axis and a dot plotted whenever . Regions of sequence similarity appear as diagonal lines on the plot and repeats as parallel lines. Insertions and deletions show up as steps in the diagonals as the diagonal moves from one diagonal to another. The most sophisticated dot-plot techniques calculate Di,j from a sliding window, and score similarities by a pair-score matrix such as BLOSUM62 or PAM250. Dotter (Sonnhammer & Durbin, 1995) provides a particularly rich set of features including an interactive dot-plot `contrast' adjustment that simplifies the interpretation of plots.
3. Multiple sequence alignment
3.1. Extending dynamic programming to more than two sequences
A multiple sequence alignment is an alignment that contains three or more sequences. In theory, dynamic programming methods for two sequences can be generalized to N sequences by adding dimensions to the H matrix. For three sequences H becomes a cube, and so on. In practice, although software has been developed that will allow three sequences to be aligned by full dynamic programming (e.g. Murata et al., 1995), aligning more than three is impractical both in terms of CPU time and memory.
A common technique to save time and space when calculating the global alignment of two sequences is to `cut corners'. Instead of calculating the entire H matrix, only a window either side of the main diagonal is calculated. This general principle has been extended to the global alignment of multiple sequences (Lipman et al., 1989). The program MSA that implements these ideas reliably performs simultaneous alignment of small numbers of sequences (typically <10).
3.2. Genetic algorithms
As an alternative to dynamic programming, the ). It is shown that alignments as mathematically optimal as those generated by MSA (Lipman et al., 1989) may be produced.
has been applied to multiple sequence alignment (Notredame & Higgins, 19963.3. Hierarchical methods of multiple sequence alignment
Hierarchical methods for multiple sequence alignment are by far the most commonly applied technique since they are fast and accurate. Hierarchical methods proceed in three steps. A schematic example of the stages in hierarchical multiple alignment is illustrated for seven globin sequences in Fig. 2.
For N sequences, all N(N-1)/2 unique pair-wise comparisons are made and the similarity scores for the comparisons recorded in a table. The similarity scores, may be simple percentage identities, or more sophisticated measures. For example, the SD score is calculated by first aligning the pair of sequences by dynamic programming and saving the raw score V for the alignment. The two sequences are then shuffled and aligned, usually at least 100 times. The score for each shuffled pair comparison is saved and the mean and standard deviation of the scores calculated. The SD score is then given by . SD scores correct for the length and composition of the sequences, and so are preferable to raw alignment scores, or percentage identity.
An example table of pair-wise SD scores is illustrated at the top left of Fig. 2. Hierarchical is then performed on the table of pair-wise scores. This process generates a dendrogram or tree that groups the most similar pair, the next most similar pair and so on. The tree is illustrated at the bottom left of Fig. 2. Finally, the multiple sequence alignment is generated by following the dendrogram from its leaves to the root. The detailed stages in this process are described in the legend to Fig. 2.
One often-stated drawback to hierarchical methods is that gaps once introduced are fixed. Thus, an error made at an early stage in the alignment process will propagate. Some methods allow a second pass through the alignment where each sequence is re-aligned to the complete alignment in order to correct the worst of these errors (Barton & Sternberg, 1987b; Gotoh, 1996).
The main differences between different hierarchical methods rest with the techniques used to score the initial pair-wise alignments, the hierarchical clustering method and the treatment of gaps in the multiple alignment. For example, pair-wise alignments may be scored by simple percentage identity, normalized alignment score (where the raw score for alignment of two sequences is divided by the length of the alignment) or a statistical SD score from Monte Carlo shuffling of the sequences. Clustering may be a simple `add one sequence at a time' method (Barton & Sternberg, 1987b), or single linkage (Barton, 1990), or neighbour joining (Thompson et al., 1994), though any clustering method may be applied. Gaps in the multiple-alignment process may be weighted simply, or position dependent (Thompson et al., 1994). One of the most sophisticated hierarchical multiple-alignment programs is CLUSTALW (Thompson et al., 1994). CLUSTALW applies different pair-score matrices when aligning sequences of differing similarity. The program also modifies gap-penalties in a position-specific fashion according to analyses of gap preferences in protein sequences (Pascarella & Argos, 1992).
4. How well do alignment methods work?
The pair-wise sequence-alignment algorithms outlined in the previous sections are guaranteed to produce a mathematically optimal alignment for two sequences given the chosen pair-score matrix (e.g. BLOSUM62) and gap-penalty function. However, the fact that an alignment is optimal for the matrix and penalty does not mean that the alignment is biologically or structurally meaningful. One can take any two sequences and optimally align them, but an optimal alignment of, say, a globin (an all-α protein) and an (all-β) is meaningless.
The standard most commonly applied is to test how well the method reproduces the alignment obtained from comparison of the protein three-dimensional structures. In recent tests against structural alignments of 693 test families of domains (Raghava & Barton, 1998), most popular hierarchical methods for multiple sequence alignment aligned over 85% of core residues in agreement with the structural alignment. Alignment accuracy is correlated with sequence similarity (Barton & Sternberg, 1987b) and for families with average pair-wise sequence identities of over 30%, the sequence alignments were over 90% in agreement with the reference alignment in the core. In contrast, for those families with 0–10% sequence identity, the average agreement in the core was <25%.
5. Software for sequence and structure comparison and analysis
The basic algorithms for pair-wise and multiple sequence alignment have been described. Here, programs from the author's group that implement some of these algorithms are briefly summarized. All software is available via the URL http://barton.ebi.ac.uk.
5.1. SCANPS – fast local alignment database searching
SCANPS implements local alignment methods (Smith & Waterman, 1981; Barton, 1993a) for protein and nucleic acid sequence database searching. The program has been made to run efficiently in parallel on Silicon Graphics hardware. A WWW server that allows searching of the SWALL non-redundant database of protein sequences is available at the EBI on http://www2.ebi.ac.uk/scanps/. Alternative software is the SSEARCH program that is distributed in the FASTA package from W. Pearson (ftp://ftp.virginia.edu).
5.2. AMPS – pair-wise and multiple alignment of sequences
The AMPS package of programs implements pair-wise global alignment with assessment of statistical significance by Monte Carlo methods. It also implements position-specific weights and a variety of profile-based searching and alignment methods. However, the main function of AMPS is to produce multiple sequence alignments by a hierarchical method. Most features of the package are reviewed by Barton (1990). CLUSTALW (Thompson et al., 1994) is an alternative to AMPS for multiple alignment. The program is available from many ftp sites.
5.3. STAMP – multiple alignment from protein three-dimensional structure comparison
STAMP (structural alignment of multiple proteins) (Russell & Barton, 1992) produces sequence alignments from comparison of protein three-dimensional structures. The algorithm is similar to that described for hierarchical multiple-sequence alignment, but STAMP works from the co-ordinates of Cα atoms rather than amino-acid characters. The program allows two or more structures to be compared, and a gapped pair-wise or multiple alignment and superposition are produced. The quality of each aligned position is assigned a value that allows the reliable regions of the structural alignment to be identified. STAMP also allows searching of a structure against a database of structures and has a number of utilities to simplify the manipulation of protein coordinate files and the preparation of publication-quality figures.
5.4. AMAS – analysis of multiply aligned sequences
AMAS (Livingstone & Barton, 1993, 1996) assists in the identification of functionally important residues from large protein-sequence alignments. A by-product of this process is a coloured and annotated representation of the alignment. An AMAS WWW server is available on http://barton.ebi.ac.uk/servers/amas_server.html.
5.5. ALSCRIPT – annotation and colouring of a multiple sequence alignment
ALSCRIPT (Barton, 1993b) allows flexible annotation of a sequence alignment. Annotations include the display of helix and strand symbols as well as alignment characters in any font and font size. Output is as a PostScript file that can be printed or viewed.
5.6. OC – general hierarchical cluster analysis
The AMPS package includes a simple clustering program, ORDER, to determine a tree for multiple alignment. ORDER is limited at compile time in the number of items it can cluster. In contrast, OC is limited only by the computer resources. OC also implements a wider range of clustering methods that may be applied to any data for hierarchical clustering.
References
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). J. Mol. Biol. 215, 403–410. CrossRef CAS PubMed Web of Science
Barton, G. J. (1990). Methods Enzymol. 183, 403–428. CrossRef CAS PubMed Web of Science
Barton, G. J. (1993a). Comput. Appl. Biosci. 9, 729–734. CAS PubMed Web of Science
Barton, G. J. (1993b). Protein Eng. 6, 37–40. CrossRef CAS PubMed Web of Science
Barton, G. J. & Sternberg, M. J. E. (1987a). Protein Eng. 1, 89–94. CrossRef CAS PubMed Web of Science
Barton, G. J. & Sternberg, M. J. E. (1987b). J. Mol. Biol. 198, 327–337. CrossRef CAS PubMed Web of Science
Barton, G. J. & Sternberg, M. J. E. (1990). J. Mol. Biol. 212, 389–402. CrossRef CAS PubMed Web of Science
Bucher, P., Karplus, K., Moeri, N. & Hofmann, K. (1996). Comput. Chem. 20(1), 3–23. CrossRef Web of Science
Dayhoff, M. O., Schwartz, R. M. & Orcutt, B. C. (1978). A Model of Evolutionary Change in Proteins. Matrices for Detecting Distant Relationships. In Atlas of Protein Sequence and Structure, Vol. 5, edited by M. O. Dayhoff, pp. 345–358. Washington DC: National Biomedical Research Foundation.
Gibbs, A. J. & McIntyre, G. A. (1970). Eur. J. Biochem. 16, 1–11. CrossRef CAS PubMed Web of Science
Gotoh, O. (1982). J. Mol. Biol. 162, 705–708. CrossRef CAS PubMed Web of Science
Gotoh, O. (1996). J. Mol. Biol. 264(4), 823–838. CrossRef Web of Science
Gribskov, M., McLachlan, A. D. & Eisenberg, D. (1987). Proc. Natl Acad. Sci. USA, 84, 4355–4358. CrossRef CAS PubMed Web of Science
Gusfield, D. (1997). Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press.
Henikoff, S. & Henikoff, J. G. (1992). Proc. Natl Acad. Sci. USA, 89, 10915–10919. CrossRef PubMed CAS Web of Science
Henikoff, S. & Henikoff, J. G. (1993). Proteins Struct. Funct. Genet. 17, 49–61. CrossRef CAS PubMed Web of Science
Jones, D. T., Taylor, W. R. & Thornton, J. M. (1992). Comput. Appl. Biosci. 8, 275–82. PubMed CAS Web of Science
Krogh, A., Brown, M., Mian, I. S., Sjolander, K. & Haussler, D. (1994). J. Mol. Biol. 235, 1501–1531. CrossRef CAS PubMed Web of Science
Lesk, A. M., Levitt, M. & Chothia, C. (1986). Protein Eng. 1, 77–78. CrossRef CAS PubMed Web of Science
Lipman, D. J., Altschul, S. F. & Kececioglu, J. (1989). Proc. Natl Acad. Sci. USA, 86, 4412–4415. CrossRef CAS PubMed Web of Science
Livingstone, C. D. & Barton, G. J. (1993). Comput. Appl. Biosci. 9, 745–756. CAS PubMed Web of Science
Livingstone, C. D. & Barton, G. J. (1996). Methods Enzymol. 266, 497–512. CrossRef CAS PubMed Web of Science
Murata, M., Richardson, J. S. & Sussman, J. L. (1985). Proc. Natl Acad. Sci. USA, 82, 3073–3077. CrossRef CAS PubMed Web of Science
Myers, E. W. & Miller, W. (1988). Comput. Appl. Biosci. 4, 11–17. CAS PubMed Web of Science
Needleman, S. B. & Wunsch, C. D. (1970). J. Mol. Biol. 48, 443–453. CrossRef CAS PubMed Web of Science
Notredame, C. & Higgins, D. G. (1996). Nucleic Acid. Res. 24(8), 1515–1524. CrossRef Web of Science
Overington, J., Donnelly, D., Johnson, M. S., Sali, A. & Blundell, T. L. (1992). Protein Sci. 1, 216–226. CrossRef PubMed CAS Web of Science
Pascarella, S. & Argos, P. (1992). J. Mol. Biol. 224, 461–471. CrossRef PubMed CAS Web of Science
Raghava, G. P. S. & Barton, G. J. (1998). Protein Sci. Submitted.
Russell, R. B. & Barton, G. J. (1992). Proteins Struct. Funct. Genet. 14, 309–323. CrossRef PubMed CAS Web of Science
Sankoff, D. & Kruskal, J. B. (1983). Editors. Time Warps, String Edits and Macromolecules: the Theory and Practice of Sequence Comparison. New York: Addison Wesley.
Saqi, M. A. S. & Sternberg, M. J. E. (1991). J. Mol. Biol. 219, 727–732. CrossRef PubMed CAS Web of Science
Sellers, P. H. (1974). SIAM J. Appl. Math. 26, 787–793. CrossRef Web of Science
Smith, T. F. & Waterman, M. S. (1981). J. Mol. Biol. 147, 195–197. CrossRef CAS PubMed Web of Science
Sonnhammer, E. L. L. & Durbin, R. (1995). Gene-Combis. 167, 1–10.
Staden, R. (1982). Nucleic Acid. Res. 10(9), 2951–2961. CrossRef Web of Science
Taylor, W. R. (1986a). J. Theor. Biol. 119, 205–218. CrossRef CAS PubMed Web of Science
Taylor, W. R. (1986b). J. Mol. Biol. 188, 233–258. CrossRef CAS PubMed Web of Science
Thompson, J. D., Higgins, D. G. & Gibson, T. J. (1994). Nucleic Acid. Res. 22, 4673–4680. CrossRef CAS PubMed Web of Science
Vingron, M. & Argos, P. (1990). Protein Eng. 3, 565–569. CrossRef CAS PubMed Web of Science
Vogt, G., Etzold, T. & Argos, P. (1995). J. Mol. Biol. 249, 816–831. CrossRef CAS PubMed Web of Science
Waterman, M. S. & Eggert, M. (1987). J. Mol. Biol. 197, 723–728. CrossRef CAS PubMed Web of Science
Waterman, M. S., Smith, T. F. & Beyer, W. A. (1976). Adv. Math. 20, 367–387. CrossRef Web of Science
Zuker, M. (1991). J. Mol. Biol. 221, 403–420. CrossRef PubMed CAS Web of Science
© International Union of Crystallography. Prior permission is not required to reproduce short quotations, tables and figures from this article, provided the original authors and source are cited. For more information, click here.