F. Read Sequence File Formats


Overview

The file formats recognized are:

The first three formats are the easiest to use. When necessary, the names and extensions can be changed to something more appropriate. Sequences can be in either uppercase or lowercase. All formats recognize the one-letter code for the 20 standard amino acids plus:

B = ASX (asp or asn)
Z = GLX (glu or gln)
X = UNK (unknown)

Many file formats contain additional comment lines, which are ignored when the files are read.


Pearson (FASTA) format (extension .aa)

The first line of the file is the title and begins with a ">". The rest of the record on the line is the title. The sequence is read until a "*" or end of file is encountered. Spaces and punctuation characters are ignored.


GCG (extension .gcg)

The GCG file may contain an arbitrary number of lines of comment at the start of the file. These are followed by a blank line and then a title line. The sequence is given with 50 residues per line, and each line beginning with the sequence number. These sequences can be obtained from the GCG package by entering command:

> FETCH -DOCL= x

and then entering the appropriate code for the sequence.

The value for x represents the number of documentation lines. For example to obtain a haemoglobin sequence the following command and code is used;

> fetch -docl = 5


HAHU

This gives the sequence with five lines of documentation. There are two blank lines. One occurs after the documentation and the other before the sequence.

Using zero lines of documentation the retrieved sequence would appear as:

The Read Sequence facility reads any gcg sequence, provided the records with the sequence information are as shown above. The placement of the integer field showing the sequence numbers is important.


NBRF-PIR

The first line contains a > at position 1 and a; at position 4, followed by he sequence ID. The second line is a title which is followed by the sequence that ends with a *. Spaces are ignored. Several different examples follow of this type of sequence format.

The following is an example of a gcg sequence converted to PIR using the utility TOPIR in the GCG package:


SWISSPROT (extension .sws)

The file begins with lines of comment which have two-letter keywords at the start of the line. The sequence is proceeded by a line beginning with the keyword SQ and is followed by a line beginning //. Spaces are ignored.


© 2006 Accelrys Software Inc.