This appendix describes the format of several files that are used with QUANTA, including molecular structure files (.msf), residue topology files (.rtf), display parameters files, external files, template files, template files for hydrogen atom addition, dummy atom files (.dum), brick map files (.mbk), quanta plot files (.qpt), ChemNote data files, and atom type files for chemnote and the Molecular Editor.
Molecular structure files (MSFs) contain data and information about a molecule. An MSF contains three levels of information:
In addition, extra information such as solvent accessibility, thermal mobility, and electrostatic potential can be included in the MSF. Extra information is incorporated as a number associated with each atom and can be retrieved through a label. This label enables the selection and coloring of a molecule based on one of these parameters. The extra information can also hold pointers to surface files, symmetry information, or vectors for each atom. In this way, virtually any information about a molecule can be held in the MSF.
The MSF is a sequential binary file. Word length is assumed to be at least 4 bytes in QUANTA unless specified otherwise. The records in the file are as follows.
1. nseg, ngroup, natom, version, header
This record is three integers, the dataset version (character * 10), and a header (character* 200) that contains various flags. The integers represent the number of segments in the file, the number of residues, and the number of atoms.
For files created in QUANTA98 and beyond, the version number is QUANTAR98. QUANTA 2006 reads earlier versions (e.g., QUANTAR3.3) and automatically updates them to QUANTA 2006 format. The utility $HYD_UTL/.msfi2i4 allows conversion between the old and new versions, if that is required.
Character*80 records, containing the title for the file. Last line is END.
3. Segment Data consists of three segment records each of length nseg. They are:
a. Segment names (character * 4)
b. Residue pointers (integer*4) that point to the first residue in the segment
c. Number of residues in the segment (integer*4)
4. Residue Data contains nine records each of length nresidues. They are:
a. Residue identifiers (character*6)
b. Residue names (character*4)
c. Atom pointers (integer*4) point to the first atom in the residue in the atom list
d. Number of atoms in the residue (integer*4)
e. Segment numbers (integer*4)
f. X Coordinate of the center of the residue (real)
g. Y Coordinate of the center of the residue (real)
h. Z Coordinate of the center of the residue (real)
i. Radius of the residue from the center (real)
5. Atom Data consists of seven records each of length natom:
e. Residue numbers (integer*4)
The residue number is a pointer back into the residue data so that the appropriate residue information is available for each atom. The number for each atom type points to parameters held in the parameter file.
The QUANTA MSF format allows the storage of extra per-atom information. An MSF created by QUANTA will by default contain a set of extra information that is loaded into the fourth parameter array. This set is the atomic temperature factors labeled BVALUE.
Eight types of data can be stored: real, integer, integer*4, real vector, symmetry, connectivity, bond orders, and atom constraints. Each extra piece of information requires two records (four in the case of a real vector).
a. The first record is a header record: label type nitems file
b. The next record(s) contains the nitem pieces of data.
c. For REAL, INTG, ASTR, and INT2, there is just one record. (Atom constraints (ASTR) can only be generated by the program.) For REA3 there are three records.
d. Connectivity information and bond order information can only be generated by the program:
Connectivity
|
Label
|
CONNECT
|
|
Type
|
BOND
|
|
Nitems
|
natom
|
A record containing the connectivity information would be: (nbc (i), (ibc (j,i,),j = 1, ncb (i)), i = 1, nitems)
is an integer*4 array containing the number of bonds to each atom. | |
is an integer*4 array containing the atom numbers of the connected atoms. |
A record, containing a list of the bond types for each bond would be: (ibf (i), (ib(j,i), j = 1,2), i = 1, nitems)
e. There is a special format for symmetry information. The records are character*80 records, which begin with a five- character special label. These are CELL, CRYS, SYMM, SYMMC and END. The CELL line contains a, b, c (angstroms), alpha, beta, gamma (degrees), and the space group name. The CRYS line contains the cell type (TRICLINIC, MONOCLINIC).
The SYMM line contains symmetry operations as defined in the International tables (this should only include the unique symmetry operations); lattice translations are defined by the lattice type.
SYMMC lines contain matrices defining non-crystallographic symmetry.
END defines the end of the symmetry information.
The display parameter file (param.par) contains the names, bonding, van der Waals radii, and energy parameters of the atoms recognized by QUANTA. This file is read every time QUANTA is started up. This data structure and the QUANTA dictionaries assign graphical display parameters to atoms. For molecules read into QUANTA from an external coordinate file, atom types are assigned solely on the basis of the atom name. This assignment may not be the same as the types determined in ChemNote or the Molecular Editor applications. The parameters in this file are consistent with those used by CHARMm. However, CHARMm accesses a different data structure (PARM.PRM) to obtain parameters for calculation. If a display parameter file is not specified, then all atoms are set to a predetermined default value.
See the QUANTA Parameter Handbook for a complete listing of the atom types used in QUANTA.
This format example is part of $HYD_LIB/param.par, the parameter file for proteins.
no bndrad vdwrad plurad global emin rmin patom hbond atype atmass
1 0.4000 0.9000 0.1000 F -0.0498 0.800 0.044 D H 1.00800
2 0.4000 0.9000 0.1000 F -0.0498 0.600 0.044 D HC 1.00800
3 0.4000 1.0000 0.1000 F -0.0420 1.330 0.1 N HA 1.00800
4 0.4000 0.9000 0.1000 F -0.0498 0.920 0.044 D HT 1.00800
The following values are applied to the atom type number listed in the first column:
If you do not enter a parameter for a particular atom type, the atom type uses the default of:
no bndrad vdwrad plurad global emin rmin patom hbond atype atmass
499 0.9 1.4 0.2 F 0.0 0.0 0.0 N DEFA
Atoms with Mxx are metals; atoms with Xxx are halogens.
If you want to extend the parameter file to include your own atom types, it is recommended that you use atom numbers between 300 and 400.
A variety of external file formats are used in QUANTA for input and output of atomic data. This section defines what these formats mean to QUANTA.
x y z - orthogonal angstrom coordinates of each atom
resid - residue identifier (residue number)
resnam - residue name, such as TRP GLU
atnam - atom name, such as CE2 CA N
bvalue - bvalue or some fourth parameter
QUANTA reads a standard PDB data file. Atomic coordinates are taken from both ATOM ("standard" groups) and HETATM ("non-standard" groups) records.
Chain identifiers (if present) are used to define segments. If none are present, a segment name is created. HETATMs are placed in separate segments.
The second character of two-symbol element names is lowercase on input. Lowercase characters are specified in QUANTA by preceding the character with the escape character, usually, but this can be altered using SET ESCAPE. The atom names are all left-justified. This process is reversed on output, to maintain the correct PDB convention.
The CHARMm/CNX/X-PLOR PDB must be used if a file is to be produced or read by CHARMm, CNX, or X-PLOR. The PDB differs from the standard Brookhaven PDB format in the following respects:
1. The segment name is a four-character string in columns 73 through 76.
2. On export, any * characters in nucleic acid names are converted to ` (the * is the CHARMm/CNX/X-PLOR wildcard character).
3. On export, amino/nucleic acids are not reordered to the Brookhaven conventional order.
4. Atom names are read or written straight into the atom name field (the Brookhaven convention is right-justified within the first two characters of the atom name field).
5. CNX and X-PLOR expect the residue ID to be left-justified.
QUANTA recognizes both the standard CHARMm and Brunger CHARMm formats. Output is only in standard format. The CHARMm format (.crd) is as follows:
1. TITLE lines (character*80) begin with a *. Last title line is * followed by at least seven blanks. Natom --- defined as i5
2. ATOM lines: atom# resid1 resnam atnam X Y Z segid resid2 bvalue
3. format: I5,I5,1x,a4,1x,a4,3f10.5,1x,a4,1x,a4,f10.5
You are given the option of using resid1 or resid2 as the residue identifier.
resnam resid atnam X Y Z bvalue 2x,a4,1x,a4,a4,4f10.5
X Y Z bvalue resid resnam atnam 4f10.5,6x,a4,15x,a3,7x,a4
The CHARMm binary format (.dcd) is used to hold many sets of coordinates, including the results of a dynamics run at various time steps. The format is as follows:
character*4 HDR, integer icntrl(20)
real*4 X(NATOM), Y(NATOM), Z(NATOM)
real*8 XTLABC(6)
logical QCRYS
HDR - not used in QUANTA
ICNTRL - contains information about the datasets held in file
QCRYS=ICNTRL(11).EQ.1
(1) - number of datasets in the file. Not necessarily correct.
(2) - time of the first dataset - usually in femtoseconds
(3) - time step between datasets
(9) - number of fixed atoms (NFIXED)
(11) - 1 for crystal/constant pressure calculation, 0 otherwise.
(20) - version number (22 for CHARMm 22, 0 for previous
version)
ntitl,(title(i),i=1,ntitl)
charager*80 title(32)
IFREAT(I),i=1,NFREAT) integer ifreat(*) points to the free atoms in the whole list of atoms.
(X(I),I = 1,NATOM)
(Y(I),I = 1,NATOM)
(Z(I),I = 1,NATOM)
The coordinates for the first dataset in the file. Time icntrl(2). If NFIXED is 0, this first dataset is the complete set of data giving positions for both the free and fixed atoms.
Note: If this is a file from a Crystal/Constant Pressure calculation, i.e. ICNTRL(11)=1 and QCRYS=TRUE, then there will be an extra record containing symmetric shape index data, XTLABC. XTLABC is a symmetric shape matrix, only lower triangle is used.
IF(QCRYS) XTLABC
(X1(I),I = 1,NFREAT)
(Y1(I),I = 1,NFREAT)
(Z1(I),I = 1,NFREAT)
Coordinates for the next dataset. If NFIXED is not 0, these coordinates are used as:
do 10 i = 1,NFREAT
10 X(IFREAT(I)) = X1(I)
The CHARMm binary property files are similar to the binary coordinate files except there is only one entry for each dataset (in contrast to the three - x y z in a coordinate file).
The Cambridge database file has a four-line header. In the following example, required spaces are represented by explanatory notes enclosed in square brackets [], indicating the number of spaces.
where the Reference Structure is the Cambridge Data Bank code number for the structure. The cell parameters given as A, B, C and ALPHA, BETA, GAMMA, and the space group code are the only header information used and stored by QUANTA. If you are writing out a file in C. D. B. format, you must provide the other information or accept meaningless defaults.
The coordinates (in cell fractional coordinates) then follow, using the form atom number, atom name, fractional coordinates, and atoms bonded to this atom (e.g., I4, 1X, A4, 1X, 3F10.5, 1X, 6I4). All atoms bonded to each atom should be listed so each bond is in effect defined twice.
Due to licensing considerations, for more information regarding this format, please contact:
Crystallographic Data Centre
12 Union Road
Cambridge
CB2 1EZ
UK
+44 1223 336408
Orthogonal coordinates may be in angstroms or nanometers. The program tests the coordinates and suggest which units are being used, but you can override this decision.
The Gromos program expects the atoms of a residue to be given in a specified order. When outputting Gromos files, QUANTA attempts to reorder atoms correctly. The file $HYD_LIB/gromos.ord contains a list of all the amino acid atoms in required order. You can edit this file if necessary. The required ordering for non-amino acid residues is not included in the file. The routine expects atom names to follow IUPAC-IUB conventions. If they do not, atoms are liable not to be recognized and placed at the end of the residue. You are informed when this is the case.
Converting Gromos Trajectory Files. The program called, GROCH converts Gromos format trajectory files to CHARMm format trajectory files which can be read by QUANTA. When using this program, you should be prepared to provide the following information on the Gromos format file:
Accelrys provides the GROCH program as source code in the QUANTA Utility directory.
Coordinate files which written as a result of quantum mechanics calculations are identified in QUANTA by the extension .qmc.
The nohpro.dic file provides a reasonable assignment of charges in proteins when there are no hydrogens in the structure. The file contains documentation describing the strategy, which is essentially:
1. If the amino acid residue is charged, then the sidechain total charge adds up to +1 or -1.
2. If it is not charged, then:
a. If it is a donor, the total charge is slightly positive.
b. If it is an acceptor, the total change is slightly negative.
c. If it is both, the total charge is null.
The generic.dic file adds some additional charges to atoms, reduces the default charges on oxygens, and adds a default charge on nitrogen.
Template files contain ideal coordinates for a residue along with other information required to perform mutations.
1. N Atom Records - The format for each atom record is:
ATNAM =
|
atom name
|
TYPE =
|
atom type
|
X Y Z =
|
atomic coordinates
|
These files are adapted from the Brookhaven Protein Data Bank (PDB) format so there are extra fields in this record which are not used by QUANTA.
2. N Pos Records - The format for each pos record is:
This specifies that the position of AT1 is defined in terms of the of AT2, AT3 and AT4.
3. N Tor Records - The format for each tor record is:
This specifies that, if AUTO TOR is turned on, the torsion angle At1- At2-At3-At4 should be set up after the mutation.
The example shown here is the template file for Valine, $HYD_LIB/tmplatnoh/val.pdb.
The template files are in subdirectories of the library directory and references to them are kept in template library files in the library directory. There are three sets of templates:
QUANTA determines the correct files to use automatically. The default template library is tmplatnoh.tlf for proteins with no hydrogen atoms defined.
The following example, the protein_polarhydrogen template library file, $HYD_LIB/tmplatpol.tlf, shows the template library format.
*N CA C
tmplatpol/ala.pdb Alanine ala A
tmplatpol/arg.pdb Arginine arg R
tmplatpol/asn.pdb Asparagine asn N
tmplatpol/asp.pdb Aspartic_acid asp D
tmplatpol/cys.pdb Cysteine cys C
The first record is a * followed by the names of three backbone atoms, that is atoms whose position is invariant when residues are mutated).
If these three atoms are not found in the residue to be mutated, the program issues the error message: Join atoms not properly defined. If this error message appears, you should not continue with the mutation. Old versions of the template library files do not contain this first line and the program assumes that the molecule is a protein with main chain atoms N, Ca, C.
The file then contains one record for each residue, listing the name of the template file, the name of the residue, a three letter code for the residue, and a one letter code for the residue. The residue name must be a single word. If necessary, words can be connected with underscores, for example, aspartic_acid).
During interactive mutation, QUANTA looks at the atoms present in the residue to be mutated and attempts to assign the correct template library file.
In the Protein Design application, QUANTA contains a simple algorithm to add hydrogen atoms to a protein. This algorithm uses a template to fit hydrogen atoms, but it does not perform energy minimization so the structure geometry may not be ideal. However, a molecular mechanics program such as CHARMm can be used to improve the structure geometry. The hydrogen bond addition routine assumes atoms have the atom type numbers defined in the param.par parameter file. If errors occur when you use this routine, it is probably due to bad atom type assignments.
The hydrogen addition algorithm superimposes a template over the existing atoms in the structure and then takes hydrogen atom coordinates from the template and adds them to the structure. In order to fit the template unambiguously, the template must contain the coordinates of three atoms which already exist in the structure. These three atoms are usually the atom to which hydrogen atoms will be attached and the two first neighbors. For example, in adding two hydrogen atoms to a tetrahedral carbon atom which is already bonded to two non-hydrogen atoms, the required template must contain coordinates for the atoms.
For some structures, the atom to which hydrogen atoms are to be added may only have one first neighbor. In this case, the third guiding atom in the template must be a second neighbor. For example, in adding three hydrogen atoms to a methyl carbon atom, the template file contains coordinates for the atoms.
The coordinates for all the templates are contained in a single file hydtpl.dat which may be found in the Library Directory.
The following format is the template for methyl C:
* add 3 hydrogens to tetrahedral C
4 1 1 3 2
13
10
2 3.980 -1.526 -2.575
0 3.494 -0.781 -1.318
0 1.956 -0.745 -1.257
3 5.069 -1.537 -2.593
3 3.607 -2.550 -2.558
3 3.607 -1.019 -3.465
Lines beginning with a * are treated as comments and are not read by the program. Each template begins with one or more comment lines.
The first line read by the program contains the following parameters:
Template number
Number of first neighbors
Number of second neighbors
Number of hydrogen atoms
Polar/non polar code
The template number must correspond to the position of the template in the template file. The polar/nonpolar code has a value of 1 for polar hydrogen atoms and 2 for nonpolar hydrogen atoms.
The next line lists the atom types for which the template is applicable. This is followed on the next line by the new atom type codes for the atoms once the hydrogens have been added.
The atom coordinates that follow are in the order:
atom to which hydrogen atoms will be attached
first neighbors
second neighbors
hydrogen atoms
The first column of the coordinate data is the atom type code. It is redundant except for hydrogen atoms. This is the atom type code that is given to added hydrogen atoms. Table 46 lists the atom type codes.
Dummy atom files (.dum) have the following format:
The following example is a file for a static dummy which was defined as a midpoint of 3 atoms. The commented lines (starting with #) form the midpoint definition.
DUM1 COOR 0 .103 .332 .418
#DUM1 MIDP 3
#ATOM C2 RESI BENZ:1 MOLE benz.msf
#ATOM C4 RESI BENZ:1 MOLE benz.msf
#ATOM C6 RESI BENZ:1 MOLE benz.msf
The following file is used for a dynamic dummy:
DUM1 MIDP 3
ATOM C2 RESI BENZ:1 MOLE benz.msf
ATOM C4 RESI BENZ:1 MOLE benz.msf
ATOM C6 RESI BENZ:1 MOLE benz.msf
The QUANTA brick map files (.mbk) can be used to store 3D information on a grid. These files can then be used to create contoured wire-frame representations of the information, graphical objects in QUANTA, or a map within the X-Ray structure package.
The information on this 3D grid is stored in bricks. The overall grid is divided into 6 x 6 x 6 pieces, and the data for each brick is stored in a single direct-access record. This approach increases the speed and flexibility of selecting and retrieving portions of a complete map for contouring. In practice, bricks overlap one another on one edge to ensure that a continuous surface is generated in contouring.
The QUANTA brick map file can be used to store single values at each point on a 3D grid, a vector at each grid point, or both a number of vectors and single points. In addition, each vector or single grid point can be a single byte or a integer*4 value. How you choose which sized value to use depends on the balance between dynamic range and disk space. Within a file, the type of data used must be the same. QUANTA contains facilities to recognize and work with any of these combinations of data types.
The header of the brick map file contains all the information about the contents of the file. QUANTA reads brick map files generated by earlier versions of the program.
Brick map files are direct-access files with record lengths of 54 words for byte-type files and 216 words for integer*4-type files. The format is:
line 1: version, ntitle, filetype
lines 2 to ntitle+1: (title(i),i=1,ntitle)
There is a limit of 50 title lines. The title contains not only textual information about the file, but also some HEADER records that detail the type of information (single point or vector) held in the file. If no HEADER records are included, QUANTA assumes this is a file containing just a single grid of scalar values.
The following additional HEADER records are only necessary if a vector field display is to be generated. The order of the HEADER records reflects the order of the data in the file. A "V" at position 27 in the HEADER record indicates that the file contains three sets of grid data corresponding to the x, y and z of a vector for the grid points. An "S" at position 27 in the HEADER record indicates a scalar set of grid data.
For example, a brick map file containing the magnitude and direction of an electrostatic field as byte values around a molecule would have the following header information:
mbk_1.0, 3,1
Gives the version number and the number of title lines (3) and indicates a byte map.
HEADERFIELD S1
Indicates that the first grid of data will be a single byte scalar value of the electrostatic field. The scale and offset from the main header apply to this grid of data
HEADER VECTOR ORIENTATION V1 1.000000E+01 0.000000E+00 -
Indicates that the next three grids of data in the file will form a vector of bytes with a scale and offset taken from the values given.
Electrostatic Filed map generated on ........
The final record before the grid data is the main header block of information specifying the various parameters that define the position, scale, grid, and so on of the map as indicated below.
card ntitle+2: nsec, mxyz,nbxyz,nw1,nu1,nu2,nv1,nv2,
cell,rhrms,offset,scale,lenbrk,ncode,rhmin,rhmax
(int4 or real4 as per first character of name).
The cell constants can be interpreted in one of two ways, depending on the value of the orthogonalization code, ncode. If ncode is between 1 and 6, then the grid is in a standard crystallographic fractional coordinate system with the cell constants, specified as a,b,c, alpha, beta, gamma, defining the transformation of orthogonal angstroms. This requires that one of the grid points falls on the origin. If the ncode is 0, then the cell constants define the origin and extent of the grid of points, specified as origin(x), origin(y), origin(z), extent(x), extent(y), extent(z), in orthogonal angstroms. This allows a grid that does not fall on the origin to be stored.
If a grid point in an integer*4 brick map has a value of 32766 then it will be ignored in the contouring within QUANTA. This is useful for masking certain parts of a grid, without getting a contour at this boundary.
The plot file (.qpt) in QUANTA is a binary file with each record written as a, x, y, z (i.e., 4 real numbers). The command number is represented by "a;" x, y, and z are parameters for the command:
a = 1 change to color (or pen) x, line width y
a = 2 move to x y z
a = 3 line to x y z
a = 4 dot at x y z
a = 5 draw x characters followed by a record of string (1: x)
a = 6 character scale x y z (interpretation of this in the program
a = 7 define patterned line where:
z is length of pattern
y is space length
x is dash lengths
a = 8 new frame
a = 9 rotate everything by x degrees
a = 10 set the units for the plot
a = 11plot a symbol as char(x) at the current point
a = 12 delete this
a = 13 set the plotter limits to x,y in physical device
coordinates
a = 14 set the plotting window to x,y
a = 15 flag to specify stereo plotting (x is the stereo angle to
use) where the stereo is created by the plotting program
a = 16 two records to specify an rgb value for a color number
the first record specifies the color number (x) the second
record specifies r g b as x, y, z
a = 17 two records to specify a filled rectangle the first record
contains the bottom left corner the second record contains
the top right corner
Not all the commands are currently used by QUANTA and the interpretation of many of them depends on your plotter and the program you use to drive it.
The following data files are used by ChemNote, the Sequence Builder, and the Molecular Editor applications.
This is a template file for the ChemNote to CHARMm conversion mode.
2.
$QNT_CHEM/chrmtype.typ and $QNT_CHEM/
chrmpost.typ
These files define the CHARMm atom types.
This file contains values for backbone structures of polypeptides. This file is read by the Sequence Builder. Changes are made by editing the file directory.
Conformation of a molecule is specified by indicating values for dihedral angles. Backbone structure often extends over several residues. Commonly used structures are defined in this file.
The format of the file, and instructions for modifying the information contained in the file are given in the file.
This file contains the shorthand names for dihedral angles and is read by the Sequence Builder. Changes are made by editing the file directly.
Conformation of a molecule is specified by indicating values for dihedral angles. Dihedral angles are identified by the four atoms involved in the angle. However, there are many shorthand names for the important angles that make dihedral identification easier. Phi, psi, and omega are some of these shorthand names in a polypeptide backbone. Because there is more than one convention for naming dihedral angles and the shorthand names may change from residue to residue, the Sequence Builder must have a flexible way of giving dihedral angles shorthand names.
The format of the file, and instructions for modifying the information contained in the file are given in the file.
This template is used by the Sequence Builder to create a command input file for residue sequences.
This file contains the menu and dialog box information for the Sequence Builder.
The files chrmtype.typ and chrmpost.typ contain the definitions used to assign atom types to atoms in ChemNote and the QUANTA Molecule Editor. The files contain several rules, each associating a pattern with an atom type. If the atoms and bonds around a specific atom match the pattern in a rule, then the atom is assigned the rule's atom type.
The format of both files is the same, However, the files vary in usage. The file chrmtype.typ is applied first to obtain most of the basic typing. Highly complicated systems such as some heterocycles and conjugated systems must have their typing refined. This is the function of the file chrmpost.typ. Both files are applied to all molecules but molecules with simple typing are not affected by the rules contained in chrmpost.typ.
In the atom type files, lines starting with `*' must appear exactly as illustrated. Lines starting with `!' are comments which are ignored by the program. The file is divided into the following sections:
The program checks the format version number against the expected number and issues an error message if the numbers do not match and the file is not read. The file update version number, which represents the last date that the file was changed by Accelrys, should not be altered.
2. The number of rules in the file
The format of this line is "P [#]", where [#] represents the number of rules defined in the file. In the example, the full file must define 298 rules. If you add or remove rules, adjust this number accordingly. There is no predefined array limit on the number of type rules that can be added, but each rule takes up some memory. If too many rules are added, typing may become slower, or in extreme cases even generate "out of memory" errors.
There must be exactly as many rules as indicated by the number on the `P' line mentioned above. It is a good idea to try to illustrate the pattern being defined in the rule using comment lines before each rule, as shown in the example. Most rules have such illustrations, and you are encouraged to keep the pictures up-to-date if you change or add rules.
A rule begins with a line containing "T [#]", where [#] states the number of subsequent lines which make up the rule. Each line of the rule after the "T" defines an atom in the pattern, so [#] also represents the number of atoms that the pattern matches. In the first rule in the example, the line is "T 4", and the rule contains four subsequent lines.
The first atom in the rule is special, since it is the atom whose type will be assigned when the pattern is found to match a part of the structure. The next few lines describe the atoms directly connected to the first atom, subsequent lines define atoms connected to these atoms, and so on. There is no real limit on how far a rule can extend, but on a practical basis rules rarely travel more than three atoms out.
The format of an atom line is: a b c element
The first field, a, specifies the line number relative to the current atom line where definitions for attached atom begin. For example, in the first atom line in the first rule in the example, a is 1. This means that the next line in the rule begins the definition of atoms attached to the central atom of the rule, in this case a hydrogen. A will always be 1 in the first line of a rule.
The second field, b, specifies the number of atoms connected to the current atom. All connected atoms must be defined in contiguous lines in the rule, starting with the line specified by the first field as described above. If b is positive, then exactly that many atoms must be attached to the atom for it to match the pattern. If b is negative, then there must be at least |b| atoms attached (|b| is the absolute value of b); if there are more the atom will still fit the pattern. If b is zero then the rule does not continue beyond this atom; it doesn't matter how many atoms are connected to it.
The third field, c, has a different meaning for the first atom than it does for subsequent atoms. For the first atom, it specifies the numeric atom type that will be assigned to the atom if the rule matches the atom and its surroundings. This number is associated with the atom type names in the file MASSES.RTF. For all lines after the first, this number represents the bond order for the bond between this atom and the subsequently-defined atom it is connected to. In the second line of the rule in the example, this number is 1, which means there must be a single bond between the N and H. A `?' may be used as a wildcard, in which case it's only important that a bond exists, not what sort of bond. Allowable bond orders are 1, 2, 3, 7, and 12, where 7 and 12 may be used interchangeably to indicate a resonant or aromatic bond.
The fourth field, element, specifies the element each atom must be to match the rule. In all lines after the first, a `?' may be used as a wildcard to match any element. So in the example rule, the first line specifies that the rule matches a hydrogen; the second line matches a nitrogen.
Looking at the first rule, we see that the first line specifies a hydrogen atom; it must be attached to only one atom, whose definition is on the next line. If the rule matches, the hydrogen will be given the atom type 2 (HC). The next line specifies that the attached atom must be a nitrogen, connected to the hydrogen by a single bond; furthermore it must be attached to exactly two other atoms, whose definitions follow. The following lines specify that it doesn't matter what element those two atoms are, just that one of the bonds must be resonant, and one single.
The above rules do not deal with cyclic systems; some atom types, however, are specific to ring systems. The last section of the file contains rules which assign ring-specific atom types.
The first line of this section has the format "R [#]", where [#] is the number of ring rules that follow. Since each ring rule is a single line, [#] also specifies the number of lines that follow before the end of the file (excluding blank or comment lines).
Each line has the following format:
type size new_type ring1 ring2 ring3
These rules are applied after the initial typing rules are finished, so they can depend on the atom types the initial rules have assigned.
The first field, type, specifies the numeric atom type an atom must have for QUANTA to attempt to apply the rule to the atom.
The second field, size, specifies that the atom must be a member of a ring of the specified size. If size is negative, it means that the ring must be aromatic as well as having |size| number of atoms. If size is -1, however, it means the rule applies to atoms in conjugated ring systems, and then the fourth, fifth, and optionally sixth fields are used to specify the sizes of the two or three rings the atom must be a member of. If size is zero, it means the atom must be a member of at least one ring, but the number of rings and the ring size is unimportant.
The third field, new_type, specifies the atom type to assign to any atom that matches the pattern described by the current rule.
33 0 32
means that any atom of type NX (33) that appears in a ring should be changed to an NP (32). The more specific rule:
22 -5 21
means that any C6R atom (22) that is in a 5-member aromatic ring should be changed to C5R (21). Finally, the conjugated ring rule:
27 -1 26 6 5 0
means that any atom of type CR66 that is a member of both a 5- and 6- membered ring should become CR56.
It is possible to have up to three conjugated rings, so ring sizes should be delimited. If fewer than three rings are used, fill the remaining fields with 0.
The file is terminated by a line containing "* End of File".
The following example represents a portion of the atom typing rule file.
* Polygen Corporation: ChemNote atomtype rules file
* File format version number
86.1124
* File update version number
91.0621
*
! Total number of patterns in the data file.
P 298
! H2-N- H on a charged group - HC
! |r
!
T 4
1 1 2 H
1 2 1 N
0 0 1 ?
0 0 12 ?
!
! HC-NC- H on a uncharged guanidinium group - HC
!
.
.
.
.
T 1
1 0 176 Re
T 1
1 0 6 MBe
T 1
1 0 7 B
!
! Ring cycles
!
.
.
.
.
R 21
33 0 32
22 -5 21
27 5 25
28 6 29
28 0 14
30 6 39
23 6 24
34 6 35
72 6 73
52 6 53
182 6 181
10 3 191
10 4 193
14 3 190
14 4 192
32 4 33
27 -1 26 6 5 0
182 -1 180 5 6 0
25 -1 26 5 6 0
181 -1 180 6 5 0
195 -1 26 6 5 0
* End of File