Galgor: Commands which deal with Genetic Algorithm and Monte Carlo. # Michal Vieth,H. Daigler, C.L. Brooks III -Dec-15-1997 Initial release. The commands described in this node are associated with genetic algorithm module for conformational searches and docking of small ligands to rigid proteins. The full description of the GA features is presented in the paper "Rational approach to docking. Optimizing the search algorithm" * Menu: * Implementation:: A brief description of the anatomy of GA * Syntax:: Syntax of the replication commands * Description:: Description of key words and commands usage * Restrictions:: Restrictions on usage * Examples:: Supplementary examples of the use of GA
Genetic Algorithm and Monte Carlo: Description and Discussion Name Keyword Module GA setup GALGOR SETUP genetic.src Genetic algorithm GALGOR EVOLVE genetic.src, genetic2.src Monte Carlo GALGOR EVOLVE MCARLO genetic.src, genetic2.src This code was created by Michal Vieth, Heidi Daigler and Charles Brooks III at The Scripps Research Institute during the summer/fall of 1997 based on the code provided by Charles Brooks and Heidi Daigler, Department of Chemistry, Carnegie Mellon University developed during the summer of 1994. Its purpose is to enable monte carlo and genetic algorithm based conformational searches to be performed on peptides/proteins, small organic molecules and docking of (small) ligands to their receptors. It builds upon the replica ideas of Leo Caves to make multiple copies of the system, i.e., the chromosomes. These chromosomes make up a population of molecular conformations which are crossed and mutated according to the specific genetic algorithm used. It is recomended to use a generational update version of genetic algorithm with elitism 1 or 2, and and island/nicheing ideas to evolve subpopulations independently. Each chromosome is represented by the set of internal coordinates (and rigid body degrees of freedom if desired) deemed to "evolve", i.e., the genes. The genes are represented as real numbers. The chromosome are crossed at a randomly chosen gene or random genes are mutated with a given rate of mutation. In addition migration of individuals between subpopulations is used if the island model is employed. From each population, the members suitable for crossing, the "parents", are chosen based upon their ranking, as measured by a priori chosen preference to select higher ranked individuals. Evolutionary presure of 1.2 based of scaled fitness following Jones, JMB 245, 43-53 is used for parents selection. The potential energy function used can be all of CHARMM's energy terms, including constraints, or any subset of them. The user energy term can be utilized to include any special functional form which is particularly amenable to the GA search. The current implementation uses the standard CHARMM energy functions. For docking the essential part of the energy function is the use of soft core VDW potential that permits ligands to diffuse into the protein. The protein in docking is rigid, and at this time it is not possible to include partial flexibility. Monte Carlo scheme uses the same engine, the only difference is that the members of the entire population become completly separate entities which do not exchange any information. Their evolution by simulated annealing is similar to MCSS molecular dynamics. Entire approach to docking and comparison of performance of GA/MD/MC is described in detail in two back to back papers submitted to J.Comp.Chem. "Rational approach to docking I and II." In the initial set-up stage, the GALGorithm keyword is parsed and the main subroutine for the genetic algorithm is called from CHARMM (GENETIC). The keyword SETUp causes a number of things to occur. First the number of chromosomes (replicas) and the atoms whose internal coordinates are to be sampled and replicated are parsed. Following this, the specific IC variables which are to be "evolved" are selected by the VARIable_ICs sub-parser. This is carried out in a separate routine which is modeled after the routine used for processing the IC edit command. Finally, the replica subroutine from Leo Caves is called to replicate the system and the original subsystem atoms are deleted by a call equivalent to delete atoms select segid orig end. Each new segment generated by the replica command is given the prefix C (for chromosome) followed by the number of that replica (chromosome). The apropriate pointer structures for inclusion/exclusion of dihedral, angle and bond ICs are then created and control is passed back to the CHARMM level, where specific manipulations can be performed on the individual chromosomes, e.g., to vary initial conformations around the initial progenerator of all chromosomes. Evolution of the population of chromosomes is controlled by calling GALGorithm with the EVOLve keyword from CHARMM. This uses a parser which sets-up the control variables for the genetic algorithm evolution and then runs the genetic evolution. The specific functions performed by this portion of the code are: 1. Parse parameters controlling the GA evolution. GA follows the scheme: 2. Evolve the population of chromosomes. This involves the following steps: a) Evaluate the energy of the population at step 0. b) Rank the population members in accordance with the viability. c) Choose parents to develop the next generation based on roulette wheel selection according to scaled fitness. d) Mate parents by crossover or random mutations in genes, e) Reconstruct (via IC build-like procedure) the cartesian positions of the new generation from modified ICs and if applicable from the modified values of the rigid degrees of freedom. f) Evaluate energies of the new generation g) if elitist model is being used transfer fittest parent into the new generation. If steady state/generational update is used replace least fit parents by children. if elitism is used retain some most fit parents. IF evolutionary strategy is pick the best members to form a new population h) Begin again at step b), cycle through to convergence or MAXGenerations or NEVALuations Monte Carlo proceeds via the following scheme: 2. Evolve the population of chromosomes. This involves the following steps: a) Evaluate the energies of each replica b) Generate new replica by mutations c) Reconstruct (via IC build-like procedure) the cartesian positions of the new generation from modified ICs and if applicable from the modified values of the rigid degrees of freedom. d) Evaluate energies of the new replicas e) Accept new replicas with Boltzmann probability h) Begin again at step b), cycle through MAXGenerations or NEVALuations
Syntax for the Genetic ALGORithm command GALGOR setup: {[SETUP] [CHROmosomes int] [atom selection] - {SEED 3x(<resnum> <atom>)} {[VARIable_IC] - {[DIHEdral] [IMPRO] [INCLude] [4x<atom selection]} {[DEPE] - [4x<atom selection] [OFFSet int]} - {{[BOND] [3x atom selection]} [ALL] [NONE]} - {{[ANGLE] [2x atom selection]} [ALL] [NONE]} - {TRAN} {ROTA} - [END]} evolution: {[EVOLVE] - GA parameters: [PARENTS int] [CHILDREN int] - {[STEADY] [ELITE int]} [EPRES real] - [CROSsover_rate real] [MUTAtion_rate real] [GSIZE int] - [NICHes int] [INTEraction_frequency int] - MC parameters : {[MCARlo] [TBEG real] [TEND real] [TFRQ int] [TJWAlking real] [IJWAlking real]} - Evolution parameters: [MAXGenerations int] [NEVAluations int] [IPRINT int] - { [TOLE real] [TOLC real]} [PRIN_frequency int] - Initialization parameters: {[RANDomization] [ISEED int] [RDIHE real] [RROTA real] - {[RTRA real] [RXTR real] [RYTR real] [RZTR real] [OFTR real]}} - Step sizes: [TRANslation_step real] - [ROTAtion_step real] [BOSTep real] [ANSTep real] - [IMPRoper_step real] [PINTernal real] [PALL real] [PCALL real] [PTALL real] - {[IBIG_step_frq int] [BTRAN real] [BROTA real] [BDIST real] - Nonbonded parameters: {[RMIN real]} [QNOE] [NBFRQ real] - Exit parameters: [DELETE] [CLEAR] [LEAVE int]}}
There are two basic parts of running genetic algorithm or Monte Carlo in CHARMM: Description of the basic key words of the genetic algorithm : The following is the description of the setup commands for setting up the system Keyword/Syntax Default Purpose SETUP setting up the data structure CHRO 50 Number of chromosomes in a population atom sele The atoms to be used as chromosomes SEED 3x(<resnum> <atom>) Specify seed atoms for ic builds Default to use first three atoms in PSF VARI Definition of the active variables - genes. This keyword acts in similar way to IC EDIT. It has to be followed by definition of internal coordinates and end with END statement. DIHE INCL <sele> ALL Within VARI specifies the selection of dihedral angles to be used as active variables. May be followed by DIHE DEPE. DIHE DEPE <sele> OFFS value Within VARI specifies the dihedral that can be computed from the value of the dihedral defined immediately before by adding the OFFSet value, This specifies dihedral dependency if two or more quartets of atoms describe rotation about the same bond DIHE IMPR <sele> ALL Improper dihedrals to be used as active variables. BOND <sele> ALL Bonds to be used as active variable ANGLE <sele> ALL Angles to be used as active variables Example: GALGOR SETUP - CHRO 50 SELE segid maa END - SEED 1 c 2 n 2 ca - VARIable_ICs DIHE INCL 1 C 2 N 2 CA 2 C end DIHE DEPE 2 CA 2 C 3 N 3 CA end OFFSET -2.0 DIHE IMPR 2 N 1 CL 1 C 1 O end BOND ALL ANGLE ALL TRAN ROTA END The following is the description of the setup commands for evolution of the system Keyword/Syntax Default Purpose EVOLVE The beginning of the evolution setup STEA off The type of evolutionary algorithm to be used. IF STEAdy key word is absent evolutionary strategy is used, in which the new population consists of CHRO most fit individuals that are chosen from 2*CHRO parents and children. If STEA key word is present generational (or steady state) update is used in which children replace least fit parents in the population. STEA is highly recomended for problems with two or more variables to avoid premature convergence of the population. PARENTS CHRO The number of parents to be used in mating, typically all chromosomes are permitted to be parents ELITE 2 So called elitism of the population - The number of parents to be retained in the next generation. If STEA key word is used with elitism equal to CHRO-2 a typical steady state update is carried, otherwise the update is called generational. With no STEA key word in evolutionary strategy elitism is not changing the evolution process. CHILDREN CHRO-ELIT The number of children created in each mating EPRES 1.2 Evolutionary presure, this is defined as the relative probability to select the most fit individual with respect to average individual. The scaled fitness is linear and based on Jones, JMB 245, 43-5 NICHes 1 The number of subpopulations (niches) to be evolved independently, This is used as yet another mean to avoid premature convergence. INTE 100 Migration frequency between niches. This is indicates how frequently individuals migrate from one subpopulation to another. THe essence of migration is top replace NICHE-1 worst members in each subpopulation by the best members of other subpopulations CROSsover 0.6 The probability of mating by crossover [0,1], higher values tend to cause faster convergence MUTA 0.4 The probability of mating by mutation, higher values prevent fast convergence. In general the sum of MUTA and CROS would be 1.0, unless we want to create children as clones of some parents. The clones will make sense in generational update, less sense in evolutionary strategy and no sense in the steady state update. GSIZE 1 The number of variables coding for a gene. Since we use floating point numbers to represent a gene 1 is strongly recomended. Higher values would affect the meaning of crossover changing it to mutation if it occurred within a gene. MCARLO off Turns on Monte Carlo procedure, instead of GA, CHROM act now as separate replicas not interacting with each other and each evolving by means of mutations independently TBEG 300.0 Initial temperature of the system in K TEND 300.0 Final temperature in K TFRQ 10 Temperature change frequency, the temperature change is calculated in the following way : tchange=(TEND-TBEG)/(MAXG/TFRQ) IJWALK MAXG+1 Frequency of sampling at higher temperature, that allows sporadically for higher acceptance ratio. The use of this jwalker is not recommended unless the system has critically low acceptance rate. The acceptance rate is printed with energy in place of GRMS printout TJWALK 400.0 Temperature allowing for better acceptance ratio turned on every IJWALK generations. This is turned on only for replicas whose acceptance ratio is less than 1% MAXGeneration 10 The total number of generations (or MC steps) for which the system is to evolve NEVAL 150000 The maximum number of energy evaluations performed on the system IPRINT 5 The number of chromosomes for which the energies are to be printed PRINT_freq 100 Printing frequency of chromosome energies. The chromosome are ranked based on energy, and depending on the print level various terms are included. By default all energy terms are reported. TOLCO 0.0 The value of the RMS deviation between the average chromosomes and all other chromosomes (measured by values of all genes) for which the evolution is terminated due to convergence in variable space. IF the value is zero the convergence is not checked TOLER 0.0 The value of the RMS deviation between the energy of the average chromosome and all other chromosomes for which the evolution due to convergence in energy. If the value is zero the convergence is not checked. RAND off IF this keyword is used the variables in all chromosomes are randomized around their original values with the spread equal to their step sizes ISEED 31415 Seed number for the ranom generator RDIHE 0.0 The spread for generating the initial value of dihedrals if RAND key word is used. With a default value dihedrals will be randomized about their initial values (with RAND present) RROTA 0.0 The spread in initial values of euler angles generated randomly if RAND key word is used. The values are in radians. IF the value is zero euler angles will be randomized about their initial values RTRA 0.0 The maximum distance of the center of mass of molecules from the point specified by RXTR RYTR RZTR for the random initial placement of molecules. RXTR 0.0 The location of the center for generation RYTR 0.0 of initial centers of masses RZTR 0.0 OFTRA 0.2 The minimum distance of the center of mass of molecules from the point specified by RXTR RYTR RZTR for the random initial placement of molecules. TRAN 0.6 The magnitude of the maximum value for the step size of translations, the actual value is calculated : (Random number-0.5)*TRAN ROTA 0.5 The magnitude of the maximum value for the step size of rotations, the actual value is calculated : (Random number-0.5)*ROTA. The values are in radians. BOSTep 0.002 The magnitude of the maximum value for the step size of bond distance change ANSTEP 2.0 The magnitude of the maximum value for the step size of bond angle change in degrees IMPROPER 2.0 The magnitude of the maximum value for the step size of improper dihedral angle change in degrees PINT 0.5 Probability to mutate internal degrees of freedom, the probability to mutate rigid body degrees of freedom is 1-pint. This applies only for the system with active translations or rotations, otherwise the value of PINT is 1. PALL 0.0 Relative probability to mutate all 3 translational or rotational degrees of freedom, default 0 with respect to other rigid body mutations PCALl 0.0 Relative probability to mutate all 6 rigid body degrees of freedom PTALl 0.0 Probability to mutate all internal degrees of freedom simultaneously IBIGstep MAXG+10 The frequency of mutations with big steps BTRAN 0.6 The step for translations every IBIG steps BROTA 0.5 The step for rotations every IBIG steps BDIS 30.0 The step for dihedrals change every IBIG steps QNOE off The flag turning on the NOE potential NBFRQ 1 Nonbonded interaction list update frequency RMIN 0.0 The soft core switching distance, the force for all nonbonded interactions at RMIN*SIGMA is the same as for the distances lower that RMIN*SIGMA. The two above flags turn on the soft core VDW potential. The strongly recomended value for RMIN in the initial stages of docking is 0.885. LEAVE CHROM The number of chromosomes to be left at the end of evolution DELETE off Delete all chromosomes except the first LEAVE CLEAR off CLearing all GA routines Example for evolution of bonds, angles, improper dihedrals and dihedrals. GALGorithm EVOLve - STEA elite 2 EPRE 1.2 - MAXG 1000 - NICHE 1 50 GSIZE 1 MUTA 0.4 CROS 0.6 - RAND ISEED 1423 - PINT 1.0 TOLC 0.01 - PRIN 10 DIST 20.0 IPRO 3.0 ANST 1.5 BOST 0.002 - NBFR 10 DELETE CLEAR
1) Since the setup of GA module is based on the replica module (with deletion of the parent replica) the requirements for running GA code are identical to the requirements for replicas. 2) Energy has to be called before the evolution step. 3) For any interaction of chromosomes with environment PSF (i.e. protein to which chromosomes are docked) it is assumed that chromosomes are in the PSF before the environment 4) Due to the existing bag in the nonobonded list generation after deletion of the parent replica PSF in order to avoid any errors in generation of nonbonded list with environment it is necessary to define PSF for the environment twice and then delete the first environment PSF: replica nrep 10 sele segid ligand end dele atom sele segid ligand end or GALGOR SETUP chrom 10 sele segid ligand end open unit 1 read form name "6Atim.pdb" read sequ pdb unit 1 close unit 1 generate 3pti setup nodihe open unit 1 read form name "6Atim.pdb" read sequ pdb unit 1 close unit 1 generate 3ptb setup nodihe dele atom sele segid 3pti end 5) In the process of generation Cartesian coordinates from internal coordinates it is ASSUMED THAT a dihedral angle IS DEFINED in the IC table for the FIRST THREE atom in the PSF. If the first three atoms from PSF cannot be used as seeds RTF has to be modified so the first three atoms define a dihedral angle in IC table . In a later stage the GA_place routine should be modified to allow for different atom seeding. 6) In order to run consecutive evolutions of GA with different parameters or consecutive runs of MC no DELETE nor CLEAR key words can be used. Otherwise segmentation fault will be reported. 7) IC FILL command has to be used when cartesian coordinated of chromosomes are read from external file if one wants to use initial values of internal coordinates corresponding to the cartesian coordinates. Otherwise internal coordinates will be taken from the parent chromosome. GA_ 8) The supported energy terms are: USER, ANGLE, BOND, UREYB, DIHE, IMDIHE, VDW, ELEC, NOE, CHARM, CDIHE.
Supplementary examples. _________________________________________________________ Alanine dipeptide rigid bond/angle/improper setup: _________________________________________________________ Read sequ card * maa * 3 AMN ALA CBX Generate maa setup ic parameters ic seed 1 cl 1 c 1 o !provide the three atoms to "seed" building ic build !build the structure based upon the internal coordinates (ics) set npar 20 GAlgorithm SETUp - CHROmosomes 20 select segid maa end - VARIable_ICs DIHEdral INCLude 1 C 2 N 2 CA 2 C end DIHEdral INCLude 2 N 2 CA 2 C 3 N end END Nbonds cdie eps 1.0 cutnb 99.0 ctofnb 90.0 wrnmxd 99.0 swit vswit energy ic fill _________________________________________________________ Alanine dipeptide global minimum by generational update GA: _________________________________________________________ GALGorithm EVOLve - RAND iseed @num rdihe 180.0 - steady epres 1.2 - MAXGenerations 1000 - niches 2 inte 10 - gsize 1 muta 0.2 pinte 1.0 cone - print 10 Dist 30.0 anst 1.0 toler 0.01 delete leave 10 clear Alanine dipeptide global minimum by evolutionary strategy: _________________________________________________________ GALGorithm EVOLve - RAND iseed @num rdihe 180.0 - MAXGenerations 1000 - niches 2 inte 10 - gsize 1 muta 0.2 pinte 1.0 cone - print 10 Dist 30.0 anst 1.0 toler 0.01 delete leave 10 clear __________________________________________________________ Alanine dipeptide global minimum by simulated annealing MC: _________________________________________________________ GALGorithm EVOLve - mCarlo Tbeg 500.0 Tend 250.0 tfrq 10 - RAND iseed 3213 rdihe 360.0 - MAXGenerations 5000 - niches 2 inte 10 - gsize 1 muta 0.2 pinte 1.0 cone - ibig 50 bdis 100.0 - print 10 Dist 20.0 anst 1.0 toler 0.01 delete clear _________________________________________________________ GA docking of g3p to tim: _________________________________________________________ setup: set nchrom 150 set nliga @nchrom set npar @nchrom GAlgorithm SETUp - CHROmosomes @nchrom select segid g3p end - VARIable_IC dihe incl 1 O1 1 C2 1 C4 1 O7 end dihe depe 1 O1 1 C2 1 C4 1 C8 end offset -120.0 dihe incl 1 C2 1 C4 1 C8 1 O10 end dihe depe 1 O7 1 C4 1 C8 1 O10 end offset -120.0 dihe incl 1 C4 1 C8 1 O10 1 P14 end dihe incl 1 H1 1 O1 1 C2 1 C4 end dihe incl 1 C2 1 C4 1 O7 1 H2 end dihe depe 1 C8 1 C4 1 O7 1 H2 end offset 120.0 dihe incl 1 C8 1 O10 1 P14 1 O15 end dihe depe 1 C8 1 O10 1 P14 1 O16 end offset 120.0 dihe depe 1 C8 1 O10 1 P14 1 O17 end offset -120.0 rota tran END mult nliga by 2 mult nchrom by 2 open unit 1 read form name "6Atim.pdb" read sequ pdb unit 1 close unit 1 generate 3pti setup nodihe open unit 1 read form name "6Atim.pdb" read sequ pdb unit 1 close unit 1 generate 3ptb setup nodihe dele atom sele segid 3pti end open unit 1 read form name 6tim_m.crd read coor ignore unit 1 append close unit 1 cons fix sele segid 3pt* end fast on wrnlev -1 energy rdie eps 3.0 cutnb 13.5 ctofnb 8.0 ctonnb 6.0 swit vswit inbfr 5 - qrmin rmin 0.885 GALGorithm EVOLve - RAND iseed @num rtran 11. rxtr -0.7 rytr -4.2 rztr -9. oftra 0.3 - rrota 6.28 rdihe 360.0 - steady - elite 2 epres 1.2 - MAXGenerations 400 - niches 5 inte 100 nbfrq 40 pcall .3 pall 0.3 - ibigstep 40 btran 1.8 brota 1.8 bdist 40.0 - qrmin rmin 0.885 qnoe - gsize 1 muta 0.7 cross .3 tran 1. rota 1. pint 0.3 - print 50 Dist 30.0 energy rdie eps 3.0 cutnb 12.5 ctofnb 8.0 ctonnb 6.0 swit vswit inbfr 5 !!!!!!!!!!!!!!!!! !!!!!!! EVOLUTION !!!!!!!!!!!!!!!!! GALGorithm EVOLve - steady - nevalu 5000000 - elite 2 epres 1.1 - MAXGenerations 70 - niches 5 inte 25 nbfrq 50 pcall .2 pall 0.2 - ibigstep 50 btran 1. brota 1. bdist 60.0 qrmin rmin 0.65 qnoe - gsize 1 muta 0.3 cross .7 tran 0.5 rota .4 pint 0.6 - print 50 Dist 30.0 GALGorithm EVOLve - steady - nevalu 5000000 - elite 2 epres 1.1 - MAXGenerations 30 - niches 5 inte 5 nbfrq 30 pcall .2 pall 0.2 - ibigstep 30 btran 1. brota 1. bdist 40.0 qnoe qrmin rmin 0.5 - gsize 1 muta 0.3 cross .7 tran 0.5 rota .4 pint 0.6 - print 50 Dist 30.0 delete clear leave 10
NIH/DCRT/Laboratory for Structural Biology
FDA/CBER/OVRR Biophysics Laboratory
Modified, updated and generalized by C.L. Brooks, III
The Scripps Research Institute