F. Read Sequence File Formats

Overview

The file formats recognized are:

FASTA (.aa extension)

GCG (.gcg extension)

NBRF-PIR (.pir extension)

Swissprot (.sws extension)

The first three formats are the easiest to use. When necessary, the names and extensions can be changed to something more appropriate. Sequences can be in either uppercase or lowercase. All formats recognize the one-letter code for the 20 standard amino acids plus:

B = ASX (asp or asn)
Z = GLX (glu or gln)
X = UNK (unknown)

Many file formats contain additional comment lines, which are ignored when the files are read.

Pearson (FASTA) format (extension .aa)

The first line of the file is the title and begins with a ">". The rest of the record on the line is the title. The sequence is read until a "*" or end of file is encountered. Spaces and punctuation characters are ignored.

>P1REIA1 BENCE-*JONES IMMUNOGLOBUL: 1 A 107 A L=107 NRES= 214
D I Q M T Q S P S S L S A S V G D R V T I T C Q A S Q D I I K Y L N W Y Q Q T P G K A P K L L I Y E A S N L Q A G V P S R F S G S G S G T D Y T F T I S S L Q P E D I A T Y Y C Q Q Y Q S L P Y T F G Q G T K L Q I T *

GCG (extension .gcg)

The GCG file may contain an arbitrary number of lines of comment at the start of the file. These are followed by a blank line and then a title line. The sequence is given with 50 residues per line, and each line beginning with the sequence number. These sequences can be obtained from the GCG package by entering command:

> FETCH -DOCL= x

and then entering the appropriate code for the sequence.

The value for x represents the number of documentation lines. For example to obtain a haemoglobin sequence the following command and code is used;

> fetch -docl = 5

HAHU

This gives the sequence with five lines of documentation. There are two blank lines. One occurs after the documentation and the other before the sequence.

P1;HAHU - Hemoglobin alpha chain - Human, chimpanzee, and pygmy chimpanzee
C;Species: Homo sapiens (man); Pan troglodytes (chimpanzee); Pan paniscus (pygmy chimpanzee, bonobo)
C;Accession: A02248
R;Michelson, A.M., and Orkin, S.H. . . .
HAHU Length: 141 January 26, 1993 11:37 Type: P Check: 9231 ..

	1	VLSPADKTNV	KAAWGKVGAH	AGEYGAEALE	RMFLSFPTTK	TYFPHFDLSH
	51	GSAQVKGHGK	KVADALTNAV	AHVDDMPNAL	SALSDLHAHK	LRVDPVNFKL
	101	LSHCLLVTLA	AHLPAEFTPA	VHASLDKFLA	SVSTVLTSKY	R

Using zero lines of documentation the retrieved sequence would appear as:

HAHU Length: 141 January 25, 1993 15:21 Type: P Check: 9231 ..

	1	VLSPADKTNV	KAAWGKVGAH	AGEYGAEALE	RMFLSFPTTK	TYFPHFDLSH
	51	GSAQVKGHGK	KVADALTNAV	AHVDDMPNAL	SALSDLHAHK	LRVDPVNFKL
	101	LSHCLLVTLA	AHLPAEFTPA	VHASLDKFLA	SVSTVLTSKY	R

The Read Sequence facility reads any gcg sequence, provided the records with the sequence information are as shown above. The placement of the integer field showing the sequence numbers is important.

NBRF-PIR

The first line contains a > at position 1 and a; at position 4, followed by he sequence ID. The second line is a title which is followed by the sequence that ends with a *. Spaces are ignored. Several different examples follow of this type of sequence format.

>P1;HAHU Hemoglobin alpha chain - Human, chimpanzee, and pygmy chimpanzee 
V L S P A D K T N V K A A W G K V G A H A G E Y G A E A L E R M F L S F P T T K T Y F P H F D L S H G S A Q V K G H G K K V A D A L T N A V A H V D D M P N A L S A L S D L H A H K L R V D P V N F K L L S H C L L V T L A A H L P A E F T P A V H A S L D K F L A S V S T V L T S K Y R *>P1;CHOA$STRSQ NRES 546 (T= 74 ) DE CHOLESTEROL OXIDASE PRECURSOR (EC 1.1.3.6) (C

MTAQQHLSRR	RMLGMAAFGA	AALAGGTTIA	APRAAAAAKS	AADNGGYVPA
VVIGTGYGAA	VSALRLGEAG	VQTLMLEMGQ	LWNQPGPDGN	IFCGMLNPDK
RSSWFKNRTE	APLGSFLWLD	VVNRNIDPYA	GVLDRVNYDQ	MSVYVGRGVG
GGSLVNGGMA	VEPKRSYFEE	ILPRVDSSEM	YDRYFPRANS	MLRVNHIDTK
WFEDTEWYKF	ARVSREQAGK	AGLGTVFVPN	VYDFGYMQRE	AAGEVPKSAL
ATEVIYGNNH	GKQSLDKTYL	AAALGTGKVT	IQTLHQVKTI	RQTKDGGYAL
TVEQKDTDGK	LLATKEISCR	YLFLGAGSLG	STELLVRARD	TGTLPNLNSE
VGAWGPNGN	IMTARANHMW	NPTGAHQSSI	PALGIDAWDN	SDSSVFAEIA
PMPAGLETWV	SLYLAITKNP	QRGTFVYDAA	TDRAKLNWTR	DQNAPAVNAA
KALFDRINKA	NGTIYRYDLF	GTQLKAFADD	FCYHPLGGCV	LGKATDDYGR
VAGYKNLYVT	DGSLIPGSVG	VNPFVTITAL	AERNVERIIK	QDVTAS*

The following is an example of a gcg sequence converted to PIR using the utility TOPIR in the GCG package:

>P1;HAHU hahu.gcg => HAHU
VLSPADKTNV	KAAWGKVGAH	AGEYGAEALE	RMFLSFPTTK	TYFPHFDLSH
GSAQVKGHGK	KVADALTNAV	AHVDDMPNAL	SALSDLHAHK	LRVDPVNFKL
LSHCLLVTLA	AHLPAEFTPA	VHASLDKFLA	SVSTVLTSKY	R* C;P1;HAHU - Hemoglobin alpha chain - Human and chimpanzees

SWISSPROT (extension .sws)

The file begins with lines of comment which have two-letter keywords at the start of the line. The sequence is proceeded by a line beginning with the keyword SQ and is followed by a line beginning //. Spaces are ignored.

ID 104K$THEPA STANDARD; PRT; 924 AA. 
AC P15711; 
DT 01-APR-1990 (REL. 14, CREATED) 
DT 01-APR-1990 (REL. 14, LAST SEQUENCE UPDATE) 
DT 01-AUG-1990 (REL. 15, LAST ANNOTATION UPDATE) 
DE 104 KD MICRONEME-RHOPTRY ANTIGEN. 
OS THEILERIA PARVA. 
OC EUKARYOTA; PROTOZOA; APICOMPLEXA; SPOROZOA; COCCIDIA; PIROPLASMIDA. 
RN [1] 
RP SEQUENCE FROM N.A. 
RC STRAIN=MUGUGA;
RC MEDLINE=90158697; 
RA IAMS K.P., YOUNG J.R., NENE V., DESAI J., WEBSTER P., 
RA OLE-MOIYOI O.K., MUSOKE A.J.;
RL MOL. BIOCHEM. PARASITOL. 39:47-60(1990). 
CC -!- DEVELOPMENTAL STAGE: SPOROZOIT ANTIGEN. 
CC -!- SUBCELLULAR LOCATION: IN MICRONEME/RHOPTRY COMPLEXES. 
DR EMBL; M29954; TP104MRA. 
KW ANTIGEN; PROLINE-RICH; REPEAT. 
FT DOMAIN 1 19 HYDROPHOBIC STRETCH.
FT DOMAIN 905 924 HYDROPHOBIC STRETCH. 
SQ SEQUENCE 924 AA; 103625 MW; 4746107 CN;
MKFLILLFNI	LCLFPVLAAD	NHGVGPQGAS	GVDPITFDIN	SNQTGPAFLT
AVEMAGVKYL
QVQHGSNVNI	HRLVEGNVVI	WENASTPLYT	GAIVTNNDGP	YMAYVEVLGD
PNLQFFIKSG
DAWVTLSEHE	YLAKLQEIRQ	AVHIESVFSL	NMAFQLENNK	YEVETHAKNG
ANMVTFIPRN
GHICKMVYHK	NVRIYKATGN	DTVTSVVGFF	RGLRLLLINV	FSIDDNGMMS
NRYFQHVDDK
YVPISQKNYE	TGIVKLKDYK	HAYHPVDLDI	KDIDYTMFHL	ADATYHEPCF
KIIPNTGFCI
TKLFDGDQVL	YESFNPLIHC	INEVHIYDRN	NGSIICLHLN	YSPPSYKAYL
VLKDTGWEAT
THPLLEEKIE	ELQDQRACEL	DVNFISDKDL	YVAALTNADL	NYTMVTPRPH
RDVIRVSDGS
EVLWYYEGLD	NFLVCAWIYV	SDGVASLVHL	RIKDRIPANN	DIYVLKGDLY
WTRITKIQFT
QEIKRLVKKS	KKKLAPITEE	DSDKHDEPPE	GPGASGLPPK	APGDKEGSEG
HKGPSKGSDS
SKEGKKPGSG	KKPGPAREHK	PSKIPTLSKK	PSGPKDPKHP	RDPKEPRKSK
SPRTASPTRR
PSPKLPQLSK	LPKSTSPRSP	PPPTRPSSPE	RPEGTKIIKT	SKPPSPKPPF
DPSFKEKFYD
DYSKAASRSK	ETKTTVVLDE	SFESILKETL	PETPGTPFTT	PRPVPPKRPR
PESPFEPPK
DPDSPSTSPS	EFFTPPESKR	TRFHETPADT	PLPDVTAELF	KEPDVTAETK
SPDEAMKRPR
SPSEYEDTSP	GDYPSLPMKR	HRLERLRLTT	TEMETDPGRM	AKDASGKPVK
LKRSKSFDDL
TTVELAPEPK	ASRIVVDDEG	TEADDEETHP	PEERQKTEVR	RRRPPKKPSK
SPRPSKPKKP
KKPDSAYIPS	ILAILVVSLI	VGIL
 //