The file formats recognized are:
The first three formats are the easiest to use. When necessary, the names and extensions can be changed to something more appropriate. Sequences can be in either uppercase or lowercase. All formats recognize the one-letter code for the 20 standard amino acids plus:
B = ASX (asp or asn)
Z = GLX (glu or gln)
X = UNK (unknown)
Many file formats contain additional comment lines, which are ignored when the files are read.
The first line of the file is the title and begins with a ">". The rest of the record on the line is the title. The sequence is read until a "*" or end of file is encountered. Spaces and punctuation characters are ignored.
>P1REIA1 BENCE-*JONES IMMUNOGLOBUL: 1 A 107 A L=107 NRES= 214
D I Q M T Q S P S S L S A S V G D R V T I T C Q A S Q D I I K Y L N W Y Q Q T P G K A P K L L I Y E A S N L Q A G V P S R F S G S G S G T D Y T F T I S S L Q P E D I A T Y Y C Q Q Y Q S L P Y T F G Q G T K L Q I T *
The GCG file may contain an arbitrary number of lines of comment at the start of the file. These are followed by a blank line and then a title line. The sequence is given with 50 residues per line, and each line beginning with the sequence number. These sequences can be obtained from the GCG package by entering command:
> FETCH -DOCL= x
and then entering the appropriate code for the sequence.
The value for x represents the number of documentation lines. For example to obtain a haemoglobin sequence the following command and code is used;
> fetch -docl = 5
This gives the sequence with five lines of documentation. There are two blank lines. One occurs after the documentation and the other before the sequence.
P1;HAHU - Hemoglobin alpha chain - Human, chimpanzee, and pygmy chimpanzee
C;Species: Homo sapiens (man); Pan troglodytes (chimpanzee); Pan paniscus (pygmy chimpanzee, bonobo)
C;Accession: A02248
R;Michelson, A.M., and Orkin, S.H. . . .
HAHU Length: 141 January 26, 1993 11:37 Type: P Check: 9231 ..
1 VLSPADKTNV KAAWGKVGAH AGEYGAEALE RMFLSFPTTK TYFPHFDLSH
51 GSAQVKGHGK KVADALTNAV AHVDDMPNAL SALSDLHAHK LRVDPVNFKL
101 LSHCLLVTLA AHLPAEFTPA VHASLDKFLA SVSTVLTSKY R
Using zero lines of documentation the retrieved sequence would appear as:
HAHU Length: 141 January 25, 1993 15:21 Type: P Check: 9231 ..
1 VLSPADKTNV KAAWGKVGAH AGEYGAEALE RMFLSFPTTK TYFPHFDLSH
51 GSAQVKGHGK KVADALTNAV AHVDDMPNAL SALSDLHAHK LRVDPVNFKL
101 LSHCLLVTLA AHLPAEFTPA VHASLDKFLA SVSTVLTSKY R
The Read Sequence facility reads any gcg sequence, provided the records with the sequence information are as shown above. The placement of the integer field showing the sequence numbers is important.
The first line contains a > at position 1 and a; at position 4, followed by he sequence ID. The second line is a title which is followed by the sequence that ends with a *. Spaces are ignored. Several different examples follow of this type of sequence format.
>P1;HAHU Hemoglobin alpha chain - Human, chimpanzee, and pygmy chimpanzee
V L S P A D K T N V K A A W G K V G A H A G E Y G A E A L E R M F L S F P T T K T Y F P H F D L S H G S A Q V K G H G K K V A D A L T N A V A H V D D M P N A L S A L S D L H A H K L R V D P V N F K L L S H C L L V T L A A H L P A E F T P A V H A S L D K F L A S V S T V L T S K Y R *>P1;CHOA$STRSQ NRES 546 (T= 74 ) DE CHOLESTEROL OXIDASE PRECURSOR (EC 1.1.3.6) (C
MTAQQHLSRR RMLGMAAFGA AALAGGTTIA APRAAAAAKS AADNGGYVPA
VVIGTGYGAA VSALRLGEAG VQTLMLEMGQ LWNQPGPDGN IFCGMLNPDK
RSSWFKNRTE APLGSFLWLD VVNRNIDPYA GVLDRVNYDQ MSVYVGRGVG
GGSLVNGGMA VEPKRSYFEE ILPRVDSSEM YDRYFPRANS MLRVNHIDTK
WFEDTEWYKF ARVSREQAGK AGLGTVFVPN VYDFGYMQRE AAGEVPKSAL
ATEVIYGNNH GKQSLDKTYL AAALGTGKVT IQTLHQVKTI RQTKDGGYAL
TVEQKDTDGK LLATKEISCR YLFLGAGSLG STELLVRARD TGTLPNLNSE
VGAWGPNGN IMTARANHMW NPTGAHQSSI PALGIDAWDN SDSSVFAEIA
PMPAGLETWV SLYLAITKNP QRGTFVYDAA TDRAKLNWTR DQNAPAVNAA
KALFDRINKA NGTIYRYDLF GTQLKAFADD FCYHPLGGCV LGKATDDYGR
VAGYKNLYVT DGSLIPGSVG VNPFVTITAL AERNVERIIK QDVTAS*
The following is an example of a gcg sequence converted to PIR using the utility TOPIR in the GCG package:
>P1;HAHU hahu.gcg => HAHU
VLSPADKTNV KAAWGKVGAH AGEYGAEALE RMFLSFPTTK TYFPHFDLSH
GSAQVKGHGK KVADALTNAV AHVDDMPNAL SALSDLHAHK LRVDPVNFKL
LSHCLLVTLA AHLPAEFTPA VHASLDKFLA SVSTVLTSKY R* C;P1;HAHU - Hemoglobin alpha chain - Human and chimpanzees
The file begins with lines of comment which have two-letter keywords at the start of the line. The sequence is proceeded by a line beginning with the keyword SQ and is followed by a line beginning //. Spaces are ignored.
ID 104K$THEPA STANDARD; PRT; 924 AA.
AC P15711;
DT 01-APR-1990 (REL. 14, CREATED)
DT 01-APR-1990 (REL. 14, LAST SEQUENCE UPDATE)
DT 01-AUG-1990 (REL. 15, LAST ANNOTATION UPDATE)
DE 104 KD MICRONEME-RHOPTRY ANTIGEN.
OS THEILERIA PARVA.
OC EUKARYOTA; PROTOZOA; APICOMPLEXA; SPOROZOA; COCCIDIA; PIROPLASMIDA.
RN [1]
RP SEQUENCE FROM N.A.
RC STRAIN=MUGUGA;
RC MEDLINE=90158697;
RA IAMS K.P., YOUNG J.R., NENE V., DESAI J., WEBSTER P.,
RA OLE-MOIYOI O.K., MUSOKE A.J.;
RL MOL. BIOCHEM. PARASITOL. 39:47-60(1990).
CC -!- DEVELOPMENTAL STAGE: SPOROZOIT ANTIGEN.
CC -!- SUBCELLULAR LOCATION: IN MICRONEME/RHOPTRY COMPLEXES.
DR EMBL; M29954; TP104MRA.
KW ANTIGEN; PROLINE-RICH; REPEAT.
FT DOMAIN 1 19 HYDROPHOBIC STRETCH.
FT DOMAIN 905 924 HYDROPHOBIC STRETCH.
SQ SEQUENCE 924 AA; 103625 MW; 4746107 CN;
MKFLILLFNI LCLFPVLAAD NHGVGPQGAS GVDPITFDIN SNQTGPAFLT
AVEMAGVKYL
QVQHGSNVNI HRLVEGNVVI WENASTPLYT GAIVTNNDGP YMAYVEVLGD
PNLQFFIKSG
DAWVTLSEHE YLAKLQEIRQ AVHIESVFSL NMAFQLENNK YEVETHAKNG
ANMVTFIPRN
GHICKMVYHK NVRIYKATGN DTVTSVVGFF RGLRLLLINV FSIDDNGMMS
NRYFQHVDDK
YVPISQKNYE TGIVKLKDYK HAYHPVDLDI KDIDYTMFHL ADATYHEPCF
KIIPNTGFCI
TKLFDGDQVL YESFNPLIHC INEVHIYDRN NGSIICLHLN YSPPSYKAYL
VLKDTGWEAT
THPLLEEKIE ELQDQRACEL DVNFISDKDL YVAALTNADL NYTMVTPRPH
RDVIRVSDGS
EVLWYYEGLD NFLVCAWIYV SDGVASLVHL RIKDRIPANN DIYVLKGDLY
WTRITKIQFT
QEIKRLVKKS KKKLAPITEE DSDKHDEPPE GPGASGLPPK APGDKEGSEG
HKGPSKGSDS
SKEGKKPGSG KKPGPAREHK PSKIPTLSKK PSGPKDPKHP RDPKEPRKSK
SPRTASPTRR
PSPKLPQLSK LPKSTSPRSP PPPTRPSSPE RPEGTKIIKT SKPPSPKPPF
DPSFKEKFYD
DYSKAASRSK ETKTTVVLDE SFESILKETL PETPGTPFTT PRPVPPKRPR
PESPFEPPK
DPDSPSTSPS EFFTPPESKR TRFHETPADT PLPDVTAELF KEPDVTAETK
SPDEAMKRPR
SPSEYEDTSP GDYPSLPMKR HRLERLRLTT TEMETDPGRM AKDASGKPVK
LKRSKSFDDL
TTVELAPEPK ASRIVVDDEG TEADDEETHP PEERQKTEVR RRRPPKKPSK
SPRPSKPKKP
KKPDSAYIPS ILAILVVSLI VGIL
//