5. Align and Superpose


Overview

The Protein Design Alignment palette provides options and tools for aligning, matching, and superposing proteins. Sequence alignment can be done automatically or the alignment can be edited manually. There are tools to aid alignment: dot plots, alignment constraints and graphical indications of homology. The alignment and matching of homologous sequences can be based on a variety of sequence and structural criteria.

S. B. Needleman and C. D. Wunch, "A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins," Journal of Molecular Biology, 48, 443 (1970).

M. O. Dayhoff, Atlas of Protein Sequence and Structure (National Biomedical Research Foundation, Silver Spring, Md., 1978), 5, supplement 3.

D. F. Feng, R.F. Doolittle, "Progressive Sequence Alignment as a Prerequisite to Correct Phylogenetic Trees," Journal of Molecular Evolution, 25, 351-360 (1987).


Aligning and Superposing Sequences

A variety of options are available for aligning sequences, matching residues, and superposing structures. A general discussion describing the different algorithms and options used for these tools follows.


Using Active Sequences and Active Ranges

All the tools on this palette are only applied to active sequences. So to align, match or superpose a limited set of the sequences or molecules you should change the sequence activity. Sequence activity is indicated on the Sequence Viewer by graying out the names of inactive sequences. The activity can be changed by either picking the sequence name on the sequence viewer or picking the A icon on the bottom left of the Sequence Viewer to bring up a dialog box which will allow you to change the activity of multiple sequences more efficiently. If the sequence is also an MSF then its activity can be changed in the Molecule Management Table.

Automatic alignment, manual editing of the alignment and application of Undo All and Undo Last can be applied only to residues within the active range. This can be particularly useful when manually editing small regions of the alignment: the Active Range tool can be used to ensure that the rest of the alignment is unchanged by insertion and deletion of gaps in the active range. In order to maintain alignment of residues to the right of the active range, gaps may be inserted or deleted at the right hand end of the active range.

The match tools are also applied only to residues within the active range. The active range can be set by picking the Set Active Range tool on the Protein Utilities palette and is indicated on the Sequence Viewer by triple red lines showing its limits. The thicker, innermost line is the actual limit.


Criteria for Aligning and Matching Sequences

Sequence alignment algorithms attempt to align pairs of residues which are similar. The algorithms require some quantitative measure of similarity. Conventional sequence alignment uses an amino acid substitution matrix which has been derived from analysis of amino acid substitutions observed in families of proteins through the course of evolution. It is possible to use other criteria; particularly if the structure of the protein is known, there are criteria for aligning the residues in the equivalent positions and environments in the structure.

The same criteria which are used to optimize the alignment can also be used to indicate the degree of homology between proteins. The match tools in this utility identify the homologous residues or ranges of residues.

In this utility there are five possible scoring schemes for alignment and matching and you can use weighted combinations of these schemes. The default scoring is a conventional sequence homology scoring system.1 It is also possible to use some combination of these criteria - a combination of 50% sequence similarity, 30% secondary structure similarity and 20% Ca-Ca distance criteria is useful for recognizing homologous structures.

The sequence homology scoring system uses the conventional Dayhoff amino acid substitution matrix which is based on the probability of replacing one amino acid type by another as observed in the evolution of families of proteins.2

The secondary structure homology scoring system scores favorably for aligning residues of similar secondary structure and penalizes aligning non-similar secondary structure.3

The accessibility scoring uses the residue fractional solvent accessibility. The score is linearly dependent on the difference in the fractional accessibility. There is a maximum score of 10.0 for no difference in accessibility. The score decreases linearly to zero for a cutoff difference of 0.3. The maximum score and cutoff can be changed using the Alignment Scores tool.

The environment class of a residue is that defined by the method of Luthy, Bowie and Eisenberg4 as used in Profile Analysis and is based on the solvent accessibility, polarity of environment and secondary structure of the residue. Before using the environment class as a criteria for alignment it should be calculated for all the relevant structures using the Plot Structure Profile tool in the Profile Analysis application5.

The Ca-Ca distance homology scoring system is based on the interatomic distances between the Ca atoms of aligned residues in different sequences. This scoring system is only applicable after structures are superposed. The score for aligning a pair of residues is linearly dependent on the Ca-Ca distance, there is a maximum score of 10.0 for a distance of zero. The score decreases linearly to zero for a cutoff distance of 5.0A.The maximum score and cutoff can be changed using the Alignment Scores tool.


Alignment

The conventional pair-wise sequence alignment method described by Needleman and Wunch6 aligns two sequences to maximize the alignment score. The alignment score is the sum of the scores for all pairs of aligned residues, minus an optional penalty for the introduction of gaps (automatic insertions and deletions) into the alignment. If there are more than two sequences to be aligned then they are aligned chronologically in pairwise fashion. 3

To align more than two sequences an alignment is performed for all pairwise combinations of active sequences and the alignment score which indicates the degree of homology of the two sequences is calculated. The normalized alignment score is also calculated by multiplying the score by 100 and divided by the number of residues in the shorter of the two sequences.

The normalized alignment score for each pair of sequences is reported in the textport and plotted as a dendogram.7 This dendogram indicates the relationships and order in which pairwise alignment is used to align multiple sequences. Sequences that join at the leftmost node in the dendogram correspond to the highest normalized alignment score, and therefore are the most similar sequences so they are aligned first. After a pair of sequences are aligned with each other they are kept fixed with respect to each other and aligned against more dissimilar sequences.

As the iterative alignment procedure is performed the alignment scores are reported in the textport. The Sequence Viewer is updated, showing the new alignment.

Gap penalties weight against the alignment algorithm introducing insertions and deletions into the alignment. Alignment can be significantly affected by the size of the gap penalties used, particularly in cases of low homology.

There are three forms of gap penalty used in QUANTA

The effect of the first two forms is that a large penalty weighs against opening a gap and there is a smaller additional penalty applied for extending the gap. The penalty for mismatching ends is, by default, lower.

The default opening penalty is a fixed value but there is an alternative penalty scheme which is dependent on secondary structure. Opening a gap in the middle of a secondary structure element, helix or strand, is heavily penalized which openings at the end of the element are less heavily penalized. This tends to force insertions and deletions to loop regions which is where they are observed most commonly in practice.

If an initial alignment does not produce the expected results then it may be worthwhile to experiment with the gap penalties whose values can be changed using the Options tool.


Manual Alignment Editing

There are several tools to allow manual adjustment of the alignment. These are particularly useful when used in conjunction with the Match Residues option to Update match when alignment changed so that as the alignment is changed the match bars in the sequence table and/or the match score plotted in the Sequence Viewer give feedback on the quality of the alignment.

The two basic manual alignment editing methods are addition and removal of gaps which allow shifting of sequences by one residue position at a time. To make bigger shifts to a sequence the Align Two Residues tool can be used. Moving entire sequences can be done with the click and drag facility in the Sequence Viewer.

By default, adding and removing gaps will cause a readjustment of the alignment for all the positions to the right of the edited position. The changes can be limited by setting an active range using the Set Active Range tool on the Protein Utilities palette. Changes to the alignment will not be propagated outside of the active range. Gaps may be inserted or deleted at the right hand end of the active range in order to maintain alignment of residues to the right of the active range.


Saving and Restoring Alignments

The Undo Last and Undo All tools will allow backtracking after automatic or manual alignment. Alignments can also be saved to file and restored later by using the Sequence Data utility on the Files pulldown. Use the Write Alignment File option to save the alignment and the Read Alignment/Sequence File option to restore the alignment. The Restore Alignment Only option should be selected so the sequences are not read into QUANTA again. Any file format can be used but the Clustal format is similar to that used by QUANTA to save alignments between sessions. Note that if you do not want to restore the alignment for all of the sequences then some sequences can be made inactive and the For Active Sequences Only option should be checked.


Dot Plots

Dot plots show a comparison between two sequences and can provide useful feedback on the quality of an alignment and suggest alternative alignments which might be tested. The x axis of a dot plot is the residue position of the first sequence and the y axis is the residue position of the second sequence. The value shown at the position x,y in the plot is a comparison score for residue x of sequence 1 to residue y of sequence 2. Various parameters can be scored and plotted: the default is the amino acid similarity score as given by the Dayhoff comparison matrix. It is convention to show normalized rather than absolute scores on dot plots. To do this, the mean and standard deviation of the scores for all the points on the dot plot are calculated and then only the relatively high scores are shown in terms of their number of standard deviations above the average score.

Also shown on the dot plot is the path of the current alignment. This is the blue line with small blue dots showing where residue x is aligned to residue y. For two similar sequences the alignment path will run roughly diagonally across the plot from bottom left to top right. Where there are gaps in the alignment, the alignment path will not run parallel to this leading diagonal and there will be a relatively long gap between the small dots indicating aligned residues.

Dot plots are usually drawn to show the comparison of a range of residues rather than single residues. If you are unfamiliar with dot plots then try drawing a dot plot for two short similar sequences. Before doing so, however, set the dot plot window to one and switch off the normalization of dot plot scores (use the Options tool to access the Dot Plot Options dialog box). This shows, for the purposes of comparison, the score of every individual residue in sequence one against every individual residue in sequence two -the checker board effect is very difficult to interpret.

Now try dot plots with window lengths of three, five and eleven. Using longer window lengths, you will see diagonal lines appear on the plot and a strong band along the leading diagonal if the two sequences are significantly similar. If the two sequences are aligned automatically, then the alignment path shown on the plot would be expected to overlay the strong diagonal lines.

In calculating a dot plot with window length eleven, the sum of the scores for comparison of eleven consecutive residues in both sequences one and two is assigned to the position on the plot corresponding to the sixth residue in the comparison window of each sequence. The average and standard deviation of the scores for all points on the plot are then calculated. Where the comparison score is above the cutoff a dot is drawn on the dot plot for each pair of residues in the comparison windows. This gives the diagonal lines which you see on the dot plot. Since each residue contributes to multiple comparison windows it is possible that it will contribute to more than one comparison scores which is above the display cutoff; when this happens the position on the dot plot is colored to indicate the larger of the comparison scores. Overlap of comparison windows with good scores may also give diagonal lines on the plot which are longer than the window length.

Dot plots for similar sequences show a strong diagonal trace roughly along the leading diagonal and after automatic alignment the blue line showing the alignment path will follow this trace.If there are other strong traces close to the leading diagonal then they indicate possible alternative alignment paths which can be explored using constraints.

Dot plots can be drawn for a limited region of two sequences by using the Select Active Range tool on the Protein Utilities palette. This can be useful in analyzing a region of low homology. Using a shorter window length will also be useful in this situation.


Alignment Constraints

Alignment constraints enable you to bias the automatic alignment algorithm to align the constrained residues. Constraints might be needed if there is experimental evidence for alignment of certain residues or if you want to explore non-optimal alignments as suggested by the dot plot. When constraints are used in an alignment, a large favorable score is assigned to aligning the constrained residues. This does not absolutely guarantee that the algorithm will align the constrained residues - the penalties incurred by aligning inappropriate residues and the gap penalties may outweigh the constraint weighting. To enforce the constraints, it is possible to increase the constraint weighting, but it is probably better to also assign constraints to neighboring residues.


Matching Residues

Matched residues are aligned residues (i.e., residues in the same column of the sequence viewer) which are homologous. The degree of homology may be determined by a variety of criteria:

Matched residues are usually indicated on the sequence viewer by a vertical yellow line. The appearance of this line can be controlled in the Sequence Viewer Options tool (accessed through the O icon at the bottom left of the Sequence Viewer). An alternative means of display is to plot the match score as a graph in the Sequence Viewer. An option to update the matched residues whenever the sequence alignment is changed is on by default.

The match scores can be analyzed to give pairwise comparison scores between all the active sequences and these scores can be analyzed to generate a dendogram of the family relationship between all of the sequences based on whatever criteria is currently being used for the match analysis.

You may also select the matches manually, an option which is useful when the matched residues are to be used as the selection criteria in another function such as in the Copy Matched Residues tool in the Create Homology Model tool.

The match score is calculated on a column-by-column basis, using the current alignment of the active sequences. Alternatively, the score can be averaged over several columns around the column under consideration. This averaging of match scores is useful for identifying homologous regions rather than just similar individual residues. The match window length is controlled by the Match Options tool.

Another tool which provides information to help assess alignments is the RMS Deviation tool. There is an option to plot a graph in the sequence viewer of the distance between equivalent atoms in aligned residues.


Color by Homology

The homology between sequences is indicated on the Sequence Viewer by vertical yellow bands and can be shown on two molecule structures by dashed lines between matched residues but this latter presentation is not easily interpretable for more than two structures. Coloring the structure residues according to their homology is useful in this case and can be activated by the Color by Homology option in the Molecule Color on the Protein Utility palette. The coloring ranges used by this tool can be changed using the Color by Homology option accessed via the Options tool.


Superposing Structures

Structure superposition overlays atoms within the matched residues of the active structures, using a least squares algorithm. By default only the Ca atoms are superposed but alternative selections are available using the Superposition Options under the Options tool.

To superpose multiple molecules, there are several cycles of superposition.8 In the initialization cycle, each of the other molecules are superposed onto a target molecule (by default this is the first selected molecule). For subsequent cycles, a template, which is an average of all molecules, is calculated, and each molecule is superposed onto the template. For each cycle, the root mean square (rms) difference in atomic coordinates between each molecule and the target template is reported. After each cycle, a new average template is calculated, and the rms difference in coordinates between this template and the template from the previous cycle is reported. If only two molecules are being superimposed, the rms difference reported is one half the rms difference between the two molecules.

If the RMS difference in template coordinates between cycles is less than 0.1Å, then the refinement is terminated; otherwise, it is terminated after 10 cycles. If you have opted to output the transformation matrix (see under the Options tool), then the translation vector and rotation matrix that have been applied to the coordinates of the molecule in order to bring it to the final superposed position are reported.

After the structures are superposed the interatomic distances between the Ca can be used as a criteria in alignment and this can be a useful means of refining the alignment to reflect structural homology.


Tools and Options

This tool aligns all the currently active sequences. If the Select Active Range tool has been used then the alignment will only be applied to the active range. The default alignment criteria is to align similar residues types but the Alignment Weights tool can be used to change the criteria. When there are only two active molecules, the sequences are immediately aligned.

To align more than two sequences the usual protocol is to align all possible pairwise combinations of sequences and calculate an alignment score. Cluster analysis of these scores determines the family relationship between the sequences which is represented by a dendogram. The default protocol then aligns all the sequences in an order determined from the dendogram. Alternative to this default protocol you may stop after generating the dendogram or you may select two sets of sequences to align. A dialog box presents you with these options when you align more than two sequences:

The options for alignment are:

This tool displays the Alignment Weights dialog box which allows you to choose a weighting scheme for using a combination of the different homology criteria. All weights should be in the range 0.0 to 1.0

This option displays the Score Parameters dialog. This dialog allows you to specify score parameters, cutoffs and change the align score file.

(default 100) The Constraint tool is used to select residues which will be pulled into alignment by the automatic alignment. The weighting of the constraint can be changed through this option.

(default 10.0) and

(default 5.0). These parameters affect the Ca-Ca distance homology scoring. The maximum score is given for a distance of zero and the score decreases linearly to zero for the cutoff distance.

(default 10.0) and

(default 5.0) These parameters affect the accessibility homology scoring. The maximum score is given for an accessibility difference of zero and the score decreases linearly to zero for the cutoff difference in residue accessibility.

By default the residue type scoring scheme is taken from the file $HYD_LIB/protein_align_score.dat which contains the Dayhoff substitution scoring matrix. An alternative file name can be entered here. Note that the file should have the same format as the default file.

This tool undoes the last sequence alignment or alignment edit.

This tool remove all gaps from the active sequences. This only applies within the Active Range if it is on.

When this tool is active, you can pick residues (on the Sequence Viewer or active molecules) and add a gap before that residue. The tool remains highlighted and active until it is deselected.

When this tool is active, you can pick a gap on the Sequence Viewer to delete it. The tool remains highlighted and active until it is deselected.

This tool aligns two residues from the sequences of two different active molecules. You are prompted with the Pick Residue palette and should then select two residues on the Sequence Viewer or molecules. The leftmost of the two residues will be moved into line with the rightmost.

This tool calculates and displays a dot plot for two sequences. If more than two sequences are currently selected then you are prompted to select just two. If the Active Range is on, then the dot plot is drawn for only the active range.

This tools brings up a dialog box with the option to choose the match criteria and also with options to control the mode of action. These are:

Undo all display of matches.

a residue selection palette allows you to manually select matched residues.

While this option is on the Match Residues tool on the palette will remain highlighted and the matched residues will be recalculated every time the alignment is changed.

The match scores are plotted in the Sequence Viewer. This option can be used in conjunction with the previous one to give updates of the plot as the alignment is changed.

Plot a dendogram based on the pairwise inter-sequence match scores.

This tool toggles a single match on or off by picking the residue position on the sequence table or the molecule.

This tool displays the Match Option dialog box. You can change the different match variables and cutoffs.

This tool superposes the matched residues of the active molecules. If a target molecule has not been selected, then the first active molecule is used.

This tool saves the superposed molecule coordinates to their respective MSFs. It activates the standard MSF saving options.

This tool rereads the last saved version of the active molecules (MSFs) and restores the coordinates. This rejects any superposed coordinates that were not saved.

This tool activates the Align and Superpose Options dialog box from which five additional options can be selected.

Atoms to Superpose: By default only the Ca atoms are superposed but alternatives are to superpose all main chain atoms or for you to enter a selection.

Choose target molecule: During the superposition one molecule will remain stationary. By default the first selected molecule is this target molecule.

Output transformation matrix: If this option is checked then after each superposition the rotation matrix and translation vector applied to each molecule is listed to the textport.

Move all atoms in molecule: By default this option is checked on. If it is switched off then you will be given the atom selection palette in order to select the atoms which will move during the superposition.

This option presents a dialog box which allows you to change the penalties assigned to creating a gap in automatic alignment. The different forms of penalty function are discussed above. The dialog box has options for you to select which forms are active and to change the penalty value for each form. There is also an option to change the maximum gap length. By default the alignment algorithm will not test alignment which involve inserting gaps greater than 40% of the sequence length (this limitation reduces calculation time) but if you are working with some exceptional sequences you may wish to change this.

Number of residues in window: By default dot plots are drawn for a single window length of 11 residues - you can change this value to anything between 1 and large values such as 31. Analysis for more than one window length can be presented on the same plot. Multiple window length can be entered in the text input line; the values should be separated by spaces.

Show constraints on dot plot: By default any constraint between residues of the two plotted sequences are shown on the dot plot.

Normalize dot plot scores: By default the coloring of dot plots uses the normalized scores where the normalization has been done over the entire dot plot. If this option is not checked then the dot plot will be colored according to the absolute scores.

This displays the Color Range dialog box from which you can edit the dot plot colors and cutoffs

The coloring of molecules and sequences is controlled by the Molecule Color tool on the Protein Utilities palette. One option is to color by the homology between sequences. The colors and cutoff values used in this coloring scheme can be changed in this dialog box.

The RMS deviation between the currently active structures are calculated and listed to textport.By default a single figure of RMS per pair of molecules is listed. If the active range is on then this tool is applied to only the residues in the active range. There are several options:

To calculate an rms deviation for only a limited set of residues you should ensure those residues are matched using either the Match Residues or Change Match tool and then check this option.

An rms deviation per residue is listed to the textport.

The rms per residue is plotted to the sequence viewer.

By default the reported rms is for just the Ca atoms but alternative selections are available.

This tool is available if there is a dendogram plot currently displayed. The dendogram will be written to a PostScript format file which can be used to create a hardcopy plot.

This tool exits the palette If structures have been superposed but not the coordinates have not being saved then you will be prompted to save them.


The Constraints Palette

The constraints palette is activated when the Constraint tool on the Align and Superpose palette is picked. The palette is closed by repacking the Constraint tool or picking the Exit Constraints tool from the Constraint palette. The palette has tools to enable selection of constraints, to save and restore constraints in external files and to toggle on or off the use of constraints in alignment.

To define a constraint you must select one residue per sequence for two or more sequences. Constraints are shown on the sequence viewer as a thin blue line between the residues. Beware that this line might be obscured by the Match indicator if it is active. If the Add Constraint tool is active then the last picked residue is indicated on the sequence viewer by a blue triangle under the residue. Constraints are also shown on dot plots by a blue circle about the position corresponding to two constrained residues.

By default once a constraint is selected or read in it will be used in any subsequent automatic alignment but the constraints can be excluded from the alignment by deactivating the Use in Auto Align tool. If not all sequences are active when an automatic alignment is performed then only the constraints with residues in two or more active sequences will be used. Constraints are not saved automatically between QUANTA sessions so any constraints required in future should be saved to file.


Constraint Palette Tools

If this tool is active then constraints can be selected by picking the appropriate residues in the sequence viewer or on the molecules. The selected residues are indicated by pale blue boxes. Only one residue per sequence should be selected; if a second residue is selected from the same sequence then there is a warning message and the option to use either the previous or the new residue. Once a residue has been selected from all the currently active sequences then the constraint is considered to be completely defined and is saved and the next residue pick is considered as the start of a new constraint. It is also possible to define a constraint between two sequences by picking a point on a dot plot for those two sequences. It may be helpful to increase the scale of the dot plot by using the full screen icon at the top right of the dot plot window or by using the Zoom Window tool on the dot plot pull down menu under Display.

If a mistake is made in selecting residues for constraints then this tool should be used to restart the selection for the last constraint.

This tool should be used after all the required residues have been selected for the current constraint. The constraint does not need to have a residue selected for every sequence but should have a residue for at least two sequences. The next residue you pick will start the definition of the next constraint.

If this tool is active then picking any one residue in a constraint will delete that constraint. Constraints can also be selected by picking the dot plot.

Deletes all constraints.

The constraints will only be used if this tool is active. If the Constraint palette is closed the status of this tool is retained for all subsequence alignments.

List all current constraints to the textport. The information is organized with each constraint on one line and all the constrained residues in each sequence in a column under the sequence name. A * character indicates that the constraint does not apply to that sequence.

Save the data to a file with the default extension .con. The information is organized with each constraint on one line and all the constrained residues in each sequence in a column. The sequence names are given at the top of the file.

A constraint file with default extension .con is opened and the constraints read. If the name of a sequence in the file does not correspond to any currently selected sequence then the information for that sequence will be ignored but so long as the constraint still has two or more residues in currently selected sequences it is read in. If sequence or MSF names have been changed since the constraint file was written it is possible to edit the file to update the names.

Close the Constraint palette. Note that if constraints are selected and the Use in Auto Align tool is active then the constraints will be used in any future alignment.

1M. O. Dayhoff, Atlas of Protein Sequence and Structure (National Biomedical Research Foundation, Silver Spring, Md., 1978), 5, supplement 3.
2These scores are stored in the file $HYD_LIB/protein_align_score.dat.
3These scores are stored in the file $HYD_LIB/protein_align_score.dat.
4R. Luthy, J.U. Bowie & D. Eisenberg "Assessment of protein models with 3D profiles" Nature 356, 83-8 5 (1992)
5The scoring schemes stored in the file $HYD_LIB/protein_align_score.dat.
6S. B. Needleman and C. D. Wunch, "A General Method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins", Journal of Molecular Biology, 48, 443 (1970
7A dendogram is a plot showing the family tree of three or more sequences and is based on scores from pairwise comparisons of sequences done either by the Align Sequences or the Match Residues tools. A dendogram plot will be produced automatically if three or more sequences are aligned or if the Dendogram option is checked in the Match Residues tool. A dendogram is like a family tree diagram showing the family relationship between sequences with most similar sequences connected by the shortest branches.
8This follows the method of Sutcliffe et al (M.J. Sutcliffe, I. Haneef, D. Carney and T.L. Blundell, Protein Engineering 1, 377-384 [1987]).

© 2006 Accelrys Software Inc.