Examples of calculations involving Proteins

Preparing a raw data dataset

One of the best resources for starting protein structures is the Protein Data Bank. The PDB contains structures for a huge number of proteins, and is being continuously updated.  Not all structures are of high accuracy, for example older X-ray structures often have unrealistic interatomic distances, however in such cases the tertiary structures are still good enough for all routine work.  NMR structures should be avoided - whenever NMR geometries are optimized, the RMS difference between the optimized and NMR structure is much larger than the X-ray equivalent.

Choice of PDB structure

Before starting work on proteins, careful thought should be given to deciding what it is that is to be done.  If the objective is to compare the predicted structure with the X-ray structure, then almost any X-ray structure would do.  If the objective is to model a reaction or a docked substrate, then a lot of human effort can be saved by selecting a structure that includes an inhibitor.  First of all, the inhibitor shows where the substrate should be and its approximate orientation.  When the geometry is optimized, the inhibitor can then be converted into the desired substrate, and the geometry re-optimized to give either the reactants, transition state, or products.

If there are several entries in the PDB that meet the desired criteria, then select the one that is most complete. Download the PDB source file, and edit it to make a MOPAC data set. MOPAC data sets can use internal or Cartesian coordinates, or a mixture of the two types.  In protein work, use Cartesian coordinates for all atoms except when a reaction path is involved, in which case a single atom (the moving atom) would normally be in internal coordinates. 


Most PDB entries lack hydrogen atoms, and many have positional or structural disorder.  Fortunately, when the focus of the work is to model the active site, structural disorder is almost always unimportant - the difference between a Leu and an Ile far from the active site never affects the mechanism.  Likewise, positional disorder is also unimportant.  For positional disorder to exist, the energies of the two conformers must be similar, so obviously either one can be chosen. 

Use a third-party program to add hydrogen atoms.  Ideally, that program would add hydrogen atoms correctly, but every program thus far examined has limitations.  Most programs look at a residue, and decide whether to ionize it without consideration of the environment.  So if the sequence -Glu-Glu-Glu- were to occur, all three residues would be ionized; this can result in unrealistic charges.  To see why, consider a dicarboxylic acid, e.g., succinic acid. The first pKa might be low, but the second is inevitably high, indicative of the fact that the di-ion is unstable.  A useful strategy to avoid unrealistic states of ionization is to first neutralize all charges.  This is not as drastic as it sounds - during the geometry optimization, protons can migrate from acid to base residues to form salt bridges.  If a specific proton does not migrate, the data set can be edited so that the proton is moved. After re-optimizing the geometry, the decision can be made regarding the preferred state based on the total heat of formation. 

Once the hydrogen atoms have been added, run a calculation using the keywords  CHARGES and RESIDUES.  In the output, check for the message "Residues where the number of hydrogen atoms is not equal to that expected."  If the message exists, carefully examine the residues indicated.  If any residues are ionized, edit the data set to remove the ionization.  Any residues that are not recognized are the result of either a mistake, in which case correct it, or are a hetero group - a non-standard residue.  If that is the case, use XENO to allow it to be recognized.  It is not necessary that all residues be  recognized in order for the calculation to run, but when they are all properly identified, the probability of errors in the data set is reduced.  Also check any hetero groups, such as heme or retinal, to verify that the hydrogenation is correct.

Preliminary geometry optimization - Hydrogen atoms only

Before optimizing the entire structure, a useful operation is to optimize the positions of the hydrogen atoms.  This is done using NOOPT and OPT-H.  Once the geometry is optimized, check the ionization states.  The "salt rules" should be obeyed: for every ionized residue, there should be a counterion near to it.  Less strict, but still important, every possible hydrogen bond should exist. The one exception to the salt rules is that all Arg should be ionized.  If any Arg are not ionized, ionize them "by hand" and if there are potential counterions nearby, ionize them to balance the charge.  If there are no counterions, put a net charge on the system. When you're satisfied that the geometry is good, save a copy of the data set.  This represents a clean starting point based on the PDB structure. In the following discussion, we'll refer to this structure as PDB.mop

Another useful operation to do at this point is to re-sequence the atoms into the standard PDB format.  This is done with keyword RESEQ.  After re-sequencing all atoms, particularly the hydrogen atoms, in the same residue are together. 

Global optimization

First, if you haven't already done so, save a copy of the structure from the previous step.  Things will go wrong naturally, so you'll want a fall-back position to re-start from when disaster occurs.  For even average sized proteins, global optimization can take CPU days, so anything that can increase the efficiency of this process is worthwhile.  The following sequence has been found to be efficient:

(A) Using keywords GNORM=10 OPT, perform a rough geometry optimization.  During this process, the energy will drop rapidly at the start, then the rate of decrease will taper off. Edit the resulting .arc file to make a new data set.
(B) Using keywords GNORM=5, perform a more precise optimization.  The energy will continue to drop, but not as fast as in the previous step.  Again edit the resulting .arc file to make a new data set.
(C) Run the new data set with 1SCF, and carefully examine the output and .arc file to make sure that everything is correct.  Pay particular attention to the active site and to important hetero groups such as HEME.

Location of Transition States

Location of transition states is a complicated process due to the great size of the systems involved.  As with global optimization, great care must be exercised in ensuring that the calculations are going correctly.  Many steps are involved, and as a result, the whole process requires a lot of human as well as CPU time.  Therefore, before starting on a project to locate a transition state, carefully consider: it the result worth the effort.  Do not start unless you are willing to expend a lot of effort, and be aware that success is not always achieved.  Having said this, if you are still committed to locating transition states in chemical reactions, the main steps are as follows:

These techniques are used in a complete worked example of the location of a transition state in chymotrypsin catalyzed hydrolysis of a peptide bond.

Prepare a data set that represents the reactants and products

Before any work is started on modeling enzyme-catalyzed reactions, the reaction to be investigated should be written out in the normal diagrammatic form.  For example, during the aspartate protease catalyzed hydrolysis of a peptide bond, two or three distinct steps are involved, with the net result that a water molecule adds across the peptide bond to give rise to a carboxylic acid and an amine.  This is, of necessity, a complicated process, so for convenience the mechanism is best expressed as the sum of a few distinct steps.

A good starting point is the Protein Data Bank.  Positioning the reactants and products is difficult, fortunately there is a simple procedure that can be used to make both reactants and products.  Enzymes can be rendered inactive by the addition of an inhibitor.  This is normally a small molecule that has the approximate shape of a transition-state geometry, but is in fact a ground-state system.  Many inhibitors are phosphorus V compounds, which mimic the enhanced valence of carbon in a transition state, for example IVA-VAL-VAL-LEU(P)-(O)PHE-ALA-ALA-OME in PDB entry 1QRP.  1QRP is an aspartate protease inactivated by a synthetic phosphate inhibitor.  At first sight, this system might not appear to be a suitable model for a protease reaction, but by replacing the phosphate group with a peptide linkage and a water molecule, the system becomes more like the reactant ground state of a hydrolysis reaction.  During the early stages of modeling hydrolysis, the nature of the groups attached to the peptide linkage are not critical, so these can be replaced with a small, innocuous, group such as methyl.  The substrate is thus converted from  IVA-VAL-VAL-LEU(P)-(O)PHE-ALA-ALA-OME to N-methyl acetamide plus water.    If the geometry of this system is optimized, the result is the reactant geometry.  By adding the water molecule to the peptide bond, the tetrahedral intermediate can be made, and geometry optimization of this structure gives rise to the product geometry.

Examine the data sets to ensure that they are what you want.

A useful calculation to run at this point is to use the two keywords: 0SCF PDBOUT.

Look at the output, and check that all residues are correctly identified.  Any that are not identified will be indicated by an error message.  If any such messages are found, and the residue is not one of the standard residues, then use keyword XENO to re-define the residue. If the residue is incorrect, correct the fault and re-run.  Although using XENO is not essential, it does reduce the possibility of having an incorrect structure by focusing attention on non-standard residues.

Check the sequence against the original PDB file.  If a residue is different, find out why it is different.  If the residue numbers are different, use keyword START_RES(text) to change the residue numbering to suit the PDB file.  If a chain name is incorrect, use CHAINS(text)  to re-name the chains to match the PDB file.

Look for any messages of the type "Residues where the number of hydrogen atoms is not equal to that expected," if any are found, carefully check the residue.  If any non-standard groups, such as heme, are present, check these using a GUI, as the checks in MOPAC do not work with such groups.

Look at the block headed "Number of charged sites identified."  Versify that the ionized sites are those expected. A useful, but not essential check is to ensure that the "salt rules" are obeyed - every ionized site should have a counterion within a few Ångstroms. 

Another file created during this run is the Protein Data Bank file <name>.pdb.  Open this file using a GUI. and compare the structure with that downloaded from the PDB.  If all the hydrogen atoms are at the end of the Z-matrix, they can be put in their correct position using RESEQ.  Look at the active site, and check that the environment is that expected.  Pay careful attention to hydrogen bonds in the active site - if one is wrongly positioned, then when the reaction occurs the energy might suddenly drop by several kcal/mol.  This can be very disconcerting!

Using the GEO_REF option, move the reactant geometry in the direction of the product geometry, and vice versa. 

Three special points on the potential energy surface are needed for a normal reaction: the reactant geometry, the product geometry and the transition state geometry.  These are, in order of increasing energy: product < reactant < transition state.  If these conditions exist, then moving the product geometry towards the reactant geometry will also move it towards the transition state.  Likewise moving the reactant geometry towards to product geometry will move it towards the transition state.  This process has to be iterative - in the first step, both reactant and product are moved on to the slope of the reaction barrier, then they are moved up the slope, then further up the slope.

Before starting, make sure that there is a one-to-one correspondence of all atoms in the reactant structure with those in the product structure.  If any hetero groups are going to be involved in the reaction, they must be present.  Small differences in the atom sequence are acceptable.  For example, the PDB convention does not define which carbon in Phe is Δ1 and which is Δ2, to allow for this necessary ambiguity the order of occurrence of atoms can be changed at run-time, provided the changes are small.

Before proceeding, read the notes on biasing a geometry towards another structure in geometry constraints. Assume that the reactant data set's name is reactant.mop, and the product data set name is product.mop

The first step in locating the transition state involves moving the product (the lower energy structure) towards the reactant.  Use the default GEO_REF keyword  for this. Edit the file product.mop to add the keyword GEO_REF="<path-name>/reactant.mop" and run that calculation.
Next, move the reactant towards the product.  Edit reactant.mop to add the keyword GEO_REF="<path-name>/product.arc", and run that job.  Note that MOPAC data sets can be either normal .mop files or MOPAC archive files.
Edit the files reactant.arc and product.arc to make new data sets.  Call these reactant_3.mop and product_3.mop. 
Almost certainly, product_3.mop will have the lower energy - check the newly-created ARC files.  Edit it to add the keyword GEO_REF="<path-name>/reactant_3.mop"10, and run that file.  Edit the file reactant_3.mop to add the keyword GEO_REF="<path-name>/product_3.arc"10,  and run that.

Edit the resulting ARC files to make new datasets, to be called reactant_10.mop and product_10.mop. Whichever one has the lower energy, use that first in the next step.  If it's product_10.mop, then use keyword  GEO_REF="<path-name>/reactant_10.mop"20, if it's reactant_10.mop, then use  GEO_REF="<path-name>/product_10.mop"20.  When that calculation is done, use the resulting ARC file in the next calculation..

At each stage, look at the output and check that the "distance" between the reactant and product geometries is decreasing.  That is, that the structures are simultaneously moving towards the transition state.

When the distance between the reactant and product geometries is small enough, the two geometries are averaged.

GEO_REF should not be used to locate the transition state - as the distance from the reactant or product to the transition state becomes small, GEO_REF becomes too slow.  Once the distance is small, less than about 10 Å, set up a normal GEO_REF calculation, but this time add the keyword TS.  That will average the two geometries and give a good approximation to the transition state.

Optimize the geometry of the rest of the system, while ignoring the transition state, and refine the geometry of the transition state, while temporarily ignoring the rest of the system.

If everything has gone according to plan, the geometry from the last step will be a good approximation to the transition state.  Before proceeding further, check that the structure is indeed what was expected.  Once satisfied that the structure is correct proceed as follows:

Using a GUI, identify the atoms involved in the reaction.  This should be the three or four atoms that are involved in bond-making, bond-breaking (call these the central atoms), plus all the atoms they are bonded to, plus all the atoms bonded to the atoms bonded to the central atoms.  This should amount to between 6 and 12 atoms.
Edit the data-set to set the optimization flags for these atoms to "0", that is, mark them as "not to be optimized."

Refining the transition state can be done in a single, long, calculation that involves repeated pairs of calculations.  The first of each pair consists of an energy minimization while freezing the transition state.  The second of each pair consists of gradient minimization of the transition state, while freezing all atoms not involved in the transition state.  As this is an unusually complicated calculation, it is helpful to see the form of the data set:



To explain: The first set of keywords defines a geometry optimization.  Because the active site is frozen, the optimization affects all the rest of the system.  The next set of keywords, "MOZYME INVERT TS OLDGEO" sets up a gradient minimization, TS, that uses the old geometry, OLDGEO, but reverses all the optimization flags, INVERT. This is the first of a pair, the second of the pair is the next calculation " MOZYME INVERT OLDGEO", and consists of an energy minimization while freezing the transition state.  This process can be repeated as often as desired, but three or four iterations are normally sufficient to reduce the gradient to about 3 or 4 kcal/mol/Ångstroms, sufficient for a transition state location operation. The final calculation should be a gradient minimization, this is because its results will be used in making a FORCETS job.

Carry out a FORCETS calculation to verify that the system is in fact a transition state.

All that remains is to characterize the transition state.  Before starting, check that the optimization flags for the transition state are all "1" and for the rest of the protein are all "0".  If the flags are reversed, then add keyword INVERT.

Run a transition state calculation using FORCETS.  This is like FORCE, but is restricted to those atoms which are flagged for optimization.  Verify that there is exactly one imaginary vibration.

Generate the full reaction path using the IRC.

Finally, run an IRC calculation.  In addition to verifying that the transition state is the correct one, it also provides information on the nearest reaction and product structures.  Because of the cost involved, use IRC=1 and IRC=-1, not IRC=1*  If the computer resources are available, run both simultaneously.