C. Customizing the Databases

Overview

To use the QUANTA databases requires two data files, that by default are in the database directory, $QNT_ROOT/library. These files are database.dat, which contains structure information, and pdbsequence.lib, that contains sequence information. Versions of these files containing the data extracted from all of the current structures in the Brookhaven Protein Databank are included in the PDB/Database release tape. To display fragments corresponding to database search hits, QUANTA reads the coordinates from the appropriate MSF file.

Users may wish to customize the databases to include additional structures. Facilities are included to recreate or append to the database files. The database creation program is $QNT_ROOT/db/crebase. This program requires a control file, the PDB master file. The standard distribution version of this file is $QNT_ROOT/db/crebase/crebase.inp. This file contains a list of PDB files that go into a database. For each PDB file, the PDB master file specifies what structural analysis should be performed and which database files should be included.

The PDB Master File

The PDB master file, pdb_master.list, contains a list of every file, available from Accelrys, in the Brookhaven Protein Databank. Additionally some basic information is found, such as molecule type, crystallographic refinement method, commands for the amount of analysis to perform by the CREBASE program, and which information to stored.

The structure and sequence database files are created by extracting the required data from PDB files. The residue IDs and atom coordinates are taken from the PDB file, and the residue geometry, such as torsion angles and side-chain centers, is calculated. The protein resolution and the information in the PDB cards HEADER, COMPND, and SOURCE are incorporated into the database. Some additional textual information is taken from the CREBASE input file; much of this data is present in the PDB file header but not in a form easily extracted automatically. If you wish to incorporate extra text information into the database, this can be included in the CREBASE input file.

CREBASE reads the PDB master file, pdb_master.list, in free format. Each line should begin with a four letter keyword command:

Name of PDB file. This must have a path name appropriate for the directory from which the CREBASE job is run.

The type of structure may be:

protein
Ca only (protein with only Ca coordinates)
nucleic acid
carbohydrate

The default is protein. Structure analysis for database structure search is only possible on protein.

The name(s) of protein families to which protein belongs.

Crystal space group

Family intement method.

Number of independent molecules in the PDB file.

The residue_id and residue_name as found in the PDB file and a fuller name.

Structural analysis of this protein is required.

Write all the information on this protein to the database file.

This CREBASE command file is also the command file for the MSFGEN program. This is a keyword for the MSFGEN program and is ignored by CREBASE; it indicates that this protein should be included in the MSF library.

Running CREBASE to create the Database Files

It is easiest to run the CREBASE program in the directory containing the PDB files. The CREBASE program starts by entering the command:

> $HYD_EXE/crebase

The user is prompted for the name of the input file, the structure database file, the sequence file, and a log file.

If the name of an existing database file is given, there is the option to append the file or overwrite it. While it is possible to append an existing database file, it is not possible to edit existing files. In the case where a log file name isn't provided, then the output is sent to the screen.The use of log files are recommended, so any problems in creating the database can be identified afterwards. This program may take considerable time to run, depending on the platform configuration.

For each protein, the log file lists: the filename, resolution, and keyword text. If any residue contains an unexpected number of atoms, it is reported as is quite common for side chains to be undefined or have alternate positions. If the structure is found to contain a high proportion of non-amino residues or the residues contain only Ca atoms, then the structure is not included in the database. Once the new versions of the database files are created they should be moved to $HYD_LIB/database.dat and $HYD_LIB/pdbsequence.lib.

Creating the MSF Library

In running a standard database search, the SEARCH program writes out a selection file which lists the hit fragments in standard QUANTA selection format. For QUANTA to use the selection file it is necessary to have MSF files for the proteins.

The easiest means to generate MSFs is to run QUANTA with a stream file which controls reading PDB files and creating new MSFs. An appropriate stream file can be generated from the PDB master file by running the jiffy program $HYD_EXE/msfgen. The keyword MSF following a PDB file name in the PDB master file indicates that PDB file should be included in the stream file.

The MSF files are created in the directory from which the QUANTA job is run. The stream file must have appropriate pathnames for the PDB files to be accessed from that directory.

C. Customizing the Databases

Overview

The PDB Master File

FILE filename

TYPE type

FAML family

XTAL space group

REFI RefFAML

NMOL nmol

LIGD residue_id residue_ name full_name

ANAL

WRIT

MSF

Running CREBASE to create the Database Files

Creating the MSF Library