Improving the accuracy of calculation of the relative binding-energy of a ligand
non-covalently attached to a protein (Back to "Proteins")

The Problem

Computing the relative binding energy of ligands non-covalently bound to proteins requires achieving an unusually high accuracy. This requirement is essential when calculating the binding ability of various ligands, in particular candidate therapeutic drugs. If the modifications described below are made then an estimate of the accuracy in computed binding energies that can be expected is as follows:

Errors in the interaction energies between the ligand and the protein. Over a range of types of interaction energies the Average Unsigned Error relative to very high-quality calculations is 1.72 kcal mol^-1 for PM6-D3H4 and 2.91 kcal mol^-1 for PM7. If the set S12L, a severe test of non-covalent intereactions, is removed, the AUE's drop to 1.04 kcal mol^-1 for PM6-D3H4 and to 1.34 kcal mol^-1 for PM7.
Incomplete geometry optimization would cause errors of about 1 kcal mol^-1.
Errors in the ΔH_f of the ligand and protein separated in solution might not cancel out.

Using these data an estimate of the average unsigned error in the relative binding energy of the ligand non-covalently bound to a protein is about 2 to 3 kcal mol^-1, provided, and this is a big qualification, the errors in 3 are small. Errors in the ΔH_f of the protein obviously cancel out, but errors in the ΔH_f of the ligand in solution are more problematic. Specifically, the known error in PM6 that under-estimates steric repulsion between non-covalently bound atoms can compromise the ΔH_f of the ligand in solution.

The problems with conventional calculation options are as follows:

The systems are large and multiple minima exist

If an entire protein is used then almost always a severe problem is encountered in the geometry optimization that prevents the necessary high accuracy in energies from being achieved. This problem is a direct consequence of the very large number of local minima on the potential energy surface (PES) and is common to all complicated systems such as proteins; it has nothing to do with the optimization procedure or with the underlying method. When a geometry is optimized the procedure used moves the atomic coordinates in a downhill direction on the PES. This process terminates when no further decrease in heat of formation (ΔH_f) can be achieved. Minor changes in the starting geometry can give rise to a different path being followed, resulting in a different minima being reached. These minima can differ in ΔH_f by several kcal mol^-1, and, although these minima might look correct, any conclusions drawn from the various ΔH_f would almost certainly be untrustworthy.

Both PM6 and PM7 methods and their modifications have errors

PM6 should never be used on its own because, it severely underestimates dispersion effects and hydrogen bonds and as a result it is essentially useless for predicting non-covalent interactions, Instead always use PM6-D3H4 or PM6-D3H4X. Having said that, the underlying PM6 method has two known errors, involving spurious sulfur - oxygen and sulfur - nitrogen interactions, see details.

PM7 has one severe error, a spurious chlorine - nitrogen non-covalent interaction. This interaction results in an excess stabilization of about 11 kcal mol^-1 in chlorine-containing ligands non-covalently bound to proteins.

Incomplete geometry optimization results in large errors

Although the default geometry optimizer for large systems, the L-BFGS optimizer, is the most efficient optimizer in MOPAC, at the end of an optimization this procedure has difficulty locating the energy minimum, and it often terminates several kcal mol^-1 above the lowest point.

Suggested modifications

Multiple minima

This procedure is only recommended for proteins that do not exhibit allosteric behavior.

Two steps are recommended. In the first step protein-ligand docking software, e.g., GOLD, should be used to generate a number of candidate conformations of a ligand docked into a protein. In the second step all of the protein far from the ligand is deleted from the system being modeled. This reduces the size of the system to only the region of interest, this region will be referred to as a Globule. When a Globule is used most of the minima are automatically excluded.

Errors in methods

PM6-D3H4 errors: Unless any sulfur atoms are near to the site of interest and when the globule geometry is optimized a severe distortion occurs, this error is unimportant and can be ignored. If corrective action is needed, then the appropriate interatomic separation should be fixed, i.e., not allowed to optimize. Alternatively, and more easily implemented, use PM7 instead of PM6-D3H4.

PM7 error: If any chlorine atoms are involved in the ligand - protein interaction, don't use PM7; use PM6-D3H4 instead.

Incomplete geometry optimization

This fault can be completely eliminated by using LET(nnn), where nnn is large. A good value for nnn is 250. This significantly increases the time required for the optimization, but the results are considerably more reliable.

Surveys

Working with ligands involves large numbers of jobs, typically the number of ligands times the number of poses from the docking program, so to minimize the possibility of making a typographic error as many steps in preparing and running a job as possible should be automated.

Suggestions for running surveys that use a Globule:

Use the SETUP keyword to supply all the keywords. If a change needs to be made to the keywords only one file, the setup file, need be edited.
Usually many different ligands and poses are used, so that a large number of jobs need to be run, in which case additional useful keywords are THREADS=1, DUMP=2D. These keywords improve performance when many jobs are run simultaneously in a batch queue. THREADS=1 turns off multi-threading inside MOPAC. When the number of jobs that can be run simultaneously equals the number of threads, each job uses one thread, so the system overhead is almost zero. In the unlikely event of a computer crashing, it is possible to restart individual jobs that were killed by the crash if a RESTART file exists. But doing this involves some typing, an error-prone operation, a simpler option is to ignore the RESTART files and just re-run any jobs that were killed. To stop RESTART files from being written, add DUMP=2D. This effectively turns off the default two-hour dump.
Use macros to construct the jobs and to run them. There is a very high probability of making errors if macros are not used. When writing macros, carefully label the various jobs, both in the job-name and in the job-title. For example, if there are 50 ligands, labeled ligand-1 to ligand-50, and 10 poses for each ligand, then job-names such as L1-5 might be used. This would indicate that the system is Ligand-1 in Pose 5. Try to use short job-names, and use the folder name to specify the survey
Run the survey.
Again, use macros to analyze the results. These macros should search the various ARC files to find the heat of formation and, optionally, the gradient norm.
If there is any doubt about the system being at a minimum, re-run all the systems but start with the protein geometry from the system that is most stable, i.e., has the largest stabilization energy. If the stabilization energy increases, then replace the old systems with the new ones. If thought necessary, this step could be repeated until the stabilization energy is a maximum for each system.
When all systems are at their most stable, a useful operation is to COMPARE the different systems to see what the differences are, for example large differences in protein geometry would be expected near the ligand, and small differences would be expected near the surface of the Globule.