2. Reading and Writing Sequence Data Files

Overview

This chapter describes tools for importing and exporting sequence data. This chapter describes:

Reading and writing sequence data files

Demo of user data

Reading and Writing Sequence Data Files

The following options are accessed from the Sequence item on the File pulldown menu. These tools import and export sequence-only data which does not have associated atomic data. To generate an MSF with a full atomic representation of a sequence, use the Create MSF from Sequence tool in the Protein Editor utility. There is also an option to import sequence-related data which has been generated external to QUANTA for display in the sequence viewer and to output the sequence viewer as a Postscript file for printing.

Read Sequence/Alignment File

This tool reads in individual sequences in FASTA, EMBL/Swissprot or GCG (Wisconsin) format. These are described in appendix F. It will also read in sequences from the alignment files in QUANTA alignment, GCG Pileup, GCG Pairs, or Clustal format.

If the Restore Alignment option is selected, then only the sequence alignment, and not the sequences themselves, will be read from the file and used to reset the alignment sequences within the viewer. If the For Active Sequences Only option is also picked, the alignment will be restored only for sequences which are currently active within QUANTA.

Note that the usual file extension for a Pileup file is msf (multiple sequence file), but to try to avoid confusion with the QUANTA MSF file, the file librarian will use a default Pileup file extension of pup. You will need to change the names of your Pileup files to use this extension or enter the required extension in the file librarian.

Read Sequence Data File

This tool reads data created outside of QUANTA and displays it in the Sequence Viewer graphs or uses it for coloring sequences. The data must be in either binary or ascii QUANTA sequence data format. These formats and how to generate them are described in detail in appendix A of this manual.

The file can contain multiple sets of data. Each data set has a label and the name of the sequence which the data applies to. The data sets associated with a sequence can either be used to color the sequence or can be plotted as a graph which is kept in synchronization with the associated sequence alignment.

The sequence data normally maps one datum per residue of a sequence, however it is possible to have data sets which are not associated with a sequence which map one datum for each column in the Sequence Viewer, but such data sets can only be displayed as graphs.

If either the Plot Graph or Color Sequence options is checked, then the data selection tools are presented. Two points to note:

Within the input sequence data file it is possible to indicate which data sets should be plotted by default so, as a user, you might be presented with the best default selection.

To color sequences using the sequence data files it is also necessary to have a seq_user_color.dat file which defines the color mapping for the sequence data. This file is described in more detail in appendix A..

Demo of User Data

To see how Sequence Data Import works you might like to try it with some demonstration files. This demonstration will read data output from the PHD package, which is a secondary structure prediction server at EMBL, into QUANTA and display the data within the Sequence Viewer.

Demo files are in $QNT_ROOT/user_group_files/sequence_data are for the sequence of a dihydrofolate reductase (dfr) whose structure is know from crystallography, though this server would normally be used to make predictions for sequences for which the structure is not known.

The output from PHD is in the file dfr.phd and includes:

The result of a sequence database search for homologous sequences. These sequences are aligned and written in the GCG Pileup format as part of the dfr.phd file. This alignment has been edited out from dfr.phd into the file dfr.pup.

Performs an analysis of secondary structure prediction which reports a three state (helix, extended or loop) propensity. That is for each residue there is a probability value in the range 0-10 of it adopting each of the three states. This data will be plotted on a graph within QUANTA.

For each residue in the sequence gives a prediction of the most probable secondary structure type which may be: helix, extended, loop or none of these. Within QUANTA the sequence will be colored to show the prediction.

For each residue in the sequence gives a prediction of the most likely structural environment: exposed or buried. These predictions can be displayed in QUANTA by coloring the sequence.

In order to read this data into QUANTA, it must first be converted to QUANTA sequence data format by a quick program described in appendix A.. The file dfr.sqdat which is generated contains data sets which are labeled HELIX, EXTED, LOOP (the secondary structure propensities), SECSTR (the secondary structure prediction) and ACCESS (the predicted solvent accessibility). All five sets of data are associated with the sequence predict_h274 which is the original input dfr sequence.

There is also a seq_user_color.dat file which defines a suitable color mapping for the SECSTR and HPDACC data which can be used to color the sequence according to its predicted secondary structure or accessibility.

To run the demo:

1. copy these files from $QNT_ROOT/user_group_files/sequence_data to your working directory:

seq_user_color.dat
dfr.pup
dfr.sqdat

2. Use the Read Sequence/Alignment File option to read the Pileup format file dfr.pup.

3. Read Sequence Data option to read the ascii data file dfr.sqdat. Select both the Plot Graph and Color Sequences options.

4. In the Color Residues According to Data dialog box, select the
SECSTR data set to color the sequence according to its final predicted secondary structure.

5. Finally, to see the accessibility prediction, use Read Sequence Data File. The original file name should still be selected by default. Make sure the Color Residues button is active and when presented with the Color Residues According to Data dialog box, select HPDACC as the data set.

Reference

Thanks to Burkhard Rost^¹at the EMBL for allowing us to use the PHD server output for this demonstration.

The PHD server is at:

http://www.embl-heidelberg.de/predictprotein/predictprotein.html

Write Sequence File

This tool writes each currently active sequence to a separate file in EMBL, FASTA, PIR or GCG format. By default the filenames are derived automatically from the sequence name and the default file extension for that format. If a file of that name already exists then you will be warned and given the option to overwrite it or give an alternative name.

Plot Sequence Viewer

This produces a file in QUANTA plot format or Idraw Postscript format for printing the Sequence Viewer. The latter format can be used directly for printing or can be read into Idraw for editing. There is a check box to choose a color plot (currently only implemented for the Postscript format) and the output color should closely match the current QUANTA color. Adjusting the QUANTA colors using the Color dials will therefore adjust the postscript colors.

By default, all currently displayed sequences will be plotted and any currently displayed graphs. By default, the entire range of the sequence is drawn. However, if the active range is currently selected then only that range is drawn. Since only short sequences will fit across a page the display of the viewer must usually be wrapped round. If the plot will extend over more than one page then each page will be written to a separate file and the files given the names name_0n.ps or name_0n.qpt where name is the filename that you entered and n is the page number. If there are existing files with these names you will be warned.

Remove Sequence

Select sequences to close from the dialog box. Note that this will only close sequences and not MSFs.

¹Rost, Burkhard; Sander, Chris: "Prediction of protein structure at better than 70% accuracy" J. Mol. Biol., 232, 584-599 (1993)

¹Rost, Burkhard; Sander, Chris; Schneider, Reinhard: "PHD - an automatic mail server for protein secondary structure prediction" CABIOS 10, 53-60 (1994).

¹Rost, Burkhard; Sander, Chris: "Combining evolutionary information and neural networks to predict protein secondary structure" Proteins, 19, 55-72 (1994).