Notes on Proteins (Back to "Proteins")

(All solids - Periodic Table - Home - Manual - PM7 Proteins - PM6-D3H4 Proteins)

General - Individual proteins: 3CL^pro - Known faults in PM6-D3H4 - Layout of tables - Comparison of geometries - Improving the accuracy of binding energies

General (top)

Proteins in the survey were selected to include a wide range of types. These cover: simple proteins, such as crambin and barnase; metalloproteins, e.g., myoglobin, and calmodulin; oligomeric proteins, e.g., hemoglobin and the potassium channel; and polymeric proteins such as silk and a collagen-like protein. The purpose of the tables of proteins is to show how to specify various types of proteins. The proteins are not intended to be used directly for modeling experiments. No checks have been done to ensure that the starting structures are chemically sensible. The suggested check - to verify that all charged sites are valid - has not been carried out.

All geometries have been optimized to the standard level of precision, this is normally 0.2 kcal mol^-1 per atom.. To see individual gradient norms, open the ARC file.

Individual proteins

3CLpro (PDB: 4MDS) (top)

PDB file 4MDS is an example of a 3CL^pro enzyme with a ligand bound to the reactive site. Standard preparation of this system consisted of hydrogenation and formation of obvious salt-bridges. When the geometry was optimized using even a small bias of 0.1 kcal mol^-1 Ångstrom^-1 in favor of the PDB structure, a significant distortion at the terminus of the side-chain in Gln-192 was observed (click on "Specific Script" to see this). A common fault in X-Ray structures is that the side-chain terminal amide groups in asparagine and glutamine are "flipped," i.e., the O and NH2 are swapped round, see Weichenberger, C. X. and M. J. Sippl (2007). "NQ-Flipper: recognition and correction of erroneous asparagine and glutamine side-chain rotamers in protein structures." Nucleic acids research 35(suppl_2): W403-W406. This fault appears to have occurred in 4MDS. When the amide group was swapped around, and the geometry re-optimized, again using a small bias of 0.1 kcal mol^-1 Ångstrom^-1 in favor of the PDB structure, the distortion vanished.

Most descriptions of the hydrolysis mechanism of 3Cl^pro refer to a catalytic diad, in contrast to the catalytic triad used in hydrolysis in chymotrypsin. But in 4MDS there is a strongly-bound water molecule, H₂O-502, ideally positioned to transfer the electrostatic potential of Asp-187 to the region of one of the conserved residues of the diad, His-41. This can be seen in the Diad structure, click on "Specific Script". Asp-187 is in such close proximity to Arg-40 that there is no doubt that a salt-bridge exists and that Asp-187 is ionized.

Known faults in PM6-D3H4 (top)

These faults are caused by an incorrect parameterization of sulfur in PM6, they are not caused by anything in D3H4.

Severe faults

Two severe faults were found during a survey of unconstrained optimization of proteins. These faults were:

(1) Spurious S - O covalent bond

To see this fault open Cobalamin (2V3N), and display Cys-98 and HOH-2031. In this system, a spurious bond was formed between [CYS]98:A.SG and [HOH]2031:J.O.

(2) Spurious S - N covalent bond

To see this fault open Zinc endoprotease (1C7K) , and display Arg-79 and Cys-112. In this system, a spurious bond was formed between [CYS]112:A.SG and [ARG]79:A.NE

These faults were found in only two of the proteins surveyed, so it's unlikely that the fault will occur in any specific protein. The faults are quite definite in that a normal covalent single bond is formed between a sulfur atom and either an oxygen or nitrogen atom. If a fault of this kind does occur, and if it is far from the site of interest, most likely the fault will not interfere with the work being done. But if the fault is near to the site of interest, or if there is any other reason that the fault should be removed, here is a simple procedure to do that.

Correcting faults of this type

Start with a normal MOPAC data-set for the protein, this should be the geometry of the original file, not the PM6-D3H4 geometry. A useful first step would be to optimize the positions of the hydrogen atoms, if that has not already been done.

Using a GUI - Jmol is particularly good for this step, see HTML - identify the sulfur atom that is causing the problem, and the oxygen or nitrogen atom involved in forming the S-O or S-N bond. In the case of zinc endoprotease, these two atoms would be "[CYS]112:A.SG" and "[ARG]79:A.NE" (in Jmol format) or "NE ARG A 79" and "SG CYS A 112" (in PDB format.)

The next task is to edit the MOPAC data-set to re-define one of the atoms in internal coordinates, this needs to be the atom with the higher atom number. Once that atom is identified, set up internal coordinates using the "distance" as that to the other atom, then the "angle" and "dihedral" in the normal manner. In the case of zinc endoprotease, the atom with the higher atom number would be the sulfur, atom number 1619. This would be at a distance of 3.65 Ångstroms from the nitrogen, see Jmol and ARC, and make an angle of 69.9° with CZ on Arg-79, and a dihedral of -76.8 with CD on Arg-79. Once the atom is defined in internal coordinates, the interatomic distance can be fixed by setting its optimization flag to zero. When all of this is done, the Cartesian coordinates of the sulfur atom in the data-set:

S(ATOM 1619 SG CYS A 112) 24.11700000 +1 22.36800000 +1 8.96800000 +1

can be replaced with internal coordinates:

S(ATOM 1619 SG CYS A 112) 3.650 0 69.9 1 -76.8 1 "[ARG]79:A.NE" "[ARG]79:A.CZ" "[ARG]79:A.CD"

Note that the "bond length" or distance has an optimization flag of "0," and that Jmol labels are used instead of the atom-numbers. Labels in text format are much easier to use because they track the atom name, which is fixed, not the atom number, which can change if atoms are added or deleted.

Then optimize the geometry. Do NOT use OPT in the optimization, as that would replace the "bond length" optimization flag with a "1." Instead do a 0SCF with OPT, then edit the geometry to remove OPT and freeze the bond-length once more,

Files that show the error: Zinc endoprotease (1C7K).html, Zinc endoprotease (1C7K).arc, and Zinc endoprotease (1C7K).pdb

Files that show the correction: Zinc endoprotease (1C7K).mop, Zinc endoprotease (1C7K).out, Zinc endoprotease (1C7K).html, Zinc endoprotease (1C7K).arc, and Zinc endoprotease (1C7K).pdb.

Other faults

PM6 underestimates the carbon-carbon steric repulsion, so whenever two non-covalently bound carbon atoms are near to each other there will be a spurious weak force pulling them together. Although this affects protein structures, it is not very large. It mainly affects ligand geometries in solution, where there is nothing to stop the ligand from folding into an unnaturally compact shape. This gives rise to an incorrect heat of formation of ligands, and as a result the calculated binding energies of ligands to proteins (in particular, enzymes) is inaccurate.

Layout of the PM7 and PM6-D3H4 Proteins tables (top)

No.: Proteins are listed in alphabetical order. The number only indicates where each protein is in that list.

Protein: All files for each protein is identified by a name. This typically consists of the name of the protein in the PDB and a four-digit name assigned by the PDB. If the name is also a hyperlink, then the "Specific Script" will be active. This will replace the default image of the protein with a feature of that specific protein.

No. atoms: The total number of atoms in each system. This includes hydrogen atoms, and all small molecules such as water, ligands etc. that are present in each system.

PDB:
ΔH_f: Proteins from the PDB often lack hydrogen atoms. These have been added and their positions optimized, during this process, some salt bridges may form. The heat of formation represents the protein with the positions of the hydrogen atoms optimized. This is the starting geometry. Jmol: A Jmol picture of the type generated by MOPAC when keyword HTML is used.
ARC: The contents of the ARC file for the protein will be downloaded. This can be copied to a local folder.
PDB: The contents of the PDB file for the protein will be downloaded. This can be copied to a local folder.

10 Kcal/Å² constraint:
The starting geometry is allowed to relax slightly, i.e., all geometric variables were optimized, but a restraint was imposed that imposed a penalty of 10 kcal mol^-1 for every Ångstrom² an atom moves from its original position.
ΔH_f: The heat of formation, in kcal mol^-1, of the system. This does not include the energy from the restraining function.
RMS: The average motion is the average movement in Ångstroms of each atom relative to the starting geometry.
Jmol: A Jmol picture of the type generated by MOPAC when keyword HTML is used.
ARC: The contents of the ARC file for the protein will be downloaded. This can be copied to a local folder.
PDB: The contents of the PDB file for the protein will be downloaded. This can be copied to a local folder.

3 Kcal/Å² constraint:
This is similar to the 10 Kcal/Å² constraint , but with the restraining function replaced with 3 kcal mol^-1 Å^-2.
ΔH_f: The heat of formation, in kcal mol^-1, of the system. This does not include the energy from the restraining function.
RMS: The average motion is the average movement in Ångstroms of each atom relative to the starting geometry.
Jmol: A Jmol picture of the type generated by MOPAC when keyword HTML is used.
ARC: The contents of the ARC file for the protein will be downloaded. This can be copied to a local folder.
PDB: The contents of the PDB file for the protein will be downloaded. This can be copied to a local folder.

Unconstrained:
ΔH_f: The heat of formation, in kcal mol^-1, of the unconstrained system.
RMS: The average motion is the average movement in Ångstroms of each atom relative to the starting geometry.
Jmol: A Jmol picture of the type generated by MOPAC when keyword HTML is used.
ARC: The contents of the ARC file for the protein will be downloaded. This can be copied to a local folder.
PDB: The contents of the PDB file for the protein will be downloaded. This can be copied to a local folder.

Statistics:

At the bottom of the tables there are average Root-Mean-Square errors in Ångstroms for the biased and unbiased optimizations. The reference geometries are the hydrogenated PDB systems.

Comparison of geometries (top)

Comparing an optimized geometry with the equivalent PDB geometry requires three files. The recommended procedure is to have a small data-set that specifies the two geometries that are to be compared, these can be .mop files, but in general .arc files are preferred as they provide more detailed information. The other two other files contain the systems to be compared.

This procedure can be illustrated by comparing the PDB structure of 2YPI with the PM7 structure where a bias of 10 kcal mol^-1 Ångstrom^-1 towards the PDB structure is added. The optimized geometry is obviously similar to the PDB structure, but many of the small errors in the PDB are corrected in going to the PM7 biased structure. To see this, download the complete worked example and look at the table "Differences between bond-lengths for the two geometries" in the output file. Select a bond-length that involves two non-hydrogen atoms, and compare the GEO_DAT bond-length (this is the geometry of the optimized PM7 with bias) with the GEO_REF bond-length (this is the PDB geometry) PM7 over-estimates the peptide C-N bond length, predicting it to be 1.37 instead of 1.33 Ångstroms.

The geometric difference between two geometries of any protein in the Protein Tables can be examined by downloading the appropriate ARC files and comparing them using the MOPAC COMPARE option. As soon as each ARC file is downloaded, re-name it so that the new name indicates which calculation was used, e.g., PM6-D3H4 or PM7, original PDB structure or partially or fully optimized.

Improving the accuracy of calculated binding energies of ligands in proteins (top)

Being able to calculate of the binding energy of a ligand to a protein is important in drug design, but in order to achieve a high-enough accuracy to be useful, i.e., trustworthy, is difficult. None of the heat of formation results for any of the proteins listed in the tables is of high enough accuracy to allow any conclusions regarding binding energies to be made.

Given that high-accuracy binding energies are important, an attempt has been made to develop a procedure to reduce the sources of errors. When this procedure is followed carefully, average unsigned errors in binding energies in the order of two to three kcal mol^-1 can be expected.