Adding hydrogen atoms
- Finding and
correcting faults in the PDB file - Resequencing
- Comparing two geometries
Freezing an interatomic distance
- Atom numbers and labels
Most PDB files do not include hydrogen atoms, so these need to be added before any serious calculations are done. The simplest way to add hydrogen atoms is to run the un-modified PDB file using MOPAC. This will produce a fully-hydrogenated system, with all ionizable sites neutralized. As this is normally not suitable for modeling because some sites should be ionized, an alternative and better hydrogenation strategy is to make a MOPAC data-set that uses the PDB file.
To hydrogenate a PDB file, create a MOPAC data-set that has an informative name, for example, if the PDB file is 5A5A.pdb, then name the MOPAC data-set "5A5A Add-H.mop" This file should contain the following lines:
GEO_DAT="5A5A.pdb" ADD-H SITE=(SALT) HTML OUTPUT NOCOMMENTS Adding hydrogen atoms to 5A5A
By using GEO_DAT the PDB file can be
left unmodified. Having the PDB file in a different folder reduces the
possibility of altering it. If that is done,
edit the GEO_DAT keyword to suit, for example GEO_DAT="../../original
PDB files/5A5A.pdb".
ADD-H is the command to hydrogenate the
entire system and neutralize all ionizable sites.
SITE=(SALT) This will
form salt-bridges, ionized pairs of amino-acid
residues that are near each other.
HTML At the start of a
project, a simple and rapid way is needed for checking various parts of the
protein system. The fastest way, see HTML, is to use Firefox and the JSmol program.
OUTPUT At this stage the focus is on hydrogenation and on any problems
associated with the original PDB file, so suppressing the geometry in the output
file allows details of the hydrogenation to be seen more easily.
NOCOMMENTS This will suppress the PDB comments, remarks, etc.
Run the
hydrogenation job, e.g., MOPAC2016.exe "5A5A Add-H.mop"
Look at the
output and at the JSmol HTML file. In the output, look to see if there are
any messages that might cause concern, such as "Residues where the number of
hydrogen atoms is not equal to that expected" If any such messages appear,
carefully look at the site(s) mentioned. If the cause of the report is
found, take the appropriate action. Most often the cause will be a severe
error in the original PDB file.
Check the hydrogenation of hetero groups, particularly porphyrin and heme heterocycles, and systems where there are more than one valid Lewis structure, such as guanine. If the system has known charged sites or if there are sites that should be ionized, and they're currently not charged, or if a hydrogen atom is missing that should be present, or a hydrogen atom is present and it's not wanted, then edit the hydrogenation data-set, e.g., edit "5A5A Add-H.mop" to correct the error, then save the edited file and re-run the hydrogenation. If other errors are found, make the appropriate changes to the hydrogenation data set and re-run it. By editing the hydrogenation data-set, a complete record of all the corrections that were made would be generated.
The next step involves transitioning from the simple hydrogenation to a more realistic and useful data-set. To do this, edit the ARC file to make a new data-set. Making sure that the various sites in the protein are correctly ionized is essential, so add CHARGES, this will print out all the charged sites found in the system along with the net charge on the protein. If there there are faults in the charges, correct them by using SITE to add or delete hydrogen atoms as appropriate in the previous step, and in complicated cases, particularly heme, by using CVB and SETPI to ensure that the Lewis structure is correct.
All that remains is to optimize the positions of the hydrogen atoms. To do this, copy the ARC file for the hydrogenated system and name the copy, e.g., "5A5A Opt-H.mop". Edit this file to make a data set. Use keywords:
CHARGE=n HTML OUTPUT EPS=78.4 NOOPT OPT-H MOZYME GNORM=5
Set the charge to suit the system. If the supplied
and calculated charges are different, examine the list of calculated ionized
sites to find what caused the difference, then take the appropriate action.
Use EPS=78.4 in all protein work, it mimics implicit water as the solvent.
NOOPT and OPT-H
together specify that only the positions of the hydrogen atoms
are to be optimized.
MOZYME - essential.
GNORM=5 (or for large systems,
over ~5000 atoms, GNORM=10) High precision of geometry optimization is not
essential here.
Run this job. It will take from an hour to about two days, depending on the size of the system. A useful user-pacifier at this point is to use the Linux "tail" utility, as in:
tail -100f "5A5A Opt-H.out"
When the job finishes, have a look at the results. Most likely there will not be anything very interesting. If there are error message, examine them and take the appropriate action.
If there is a reason to suspect that there are errors in the PDB file a test to identify geometric errors can be run. But don't do this test unless there is a good reason. This requires running a single SCF calculation and calculating the gradients (forces) acting on all the non-hydrogen atoms. For this, start with the ARC file from the hydrogenation operation. Use the ARC file in which the positions of the hydrogen atoms have been optimized. Edit this file to make a data-set named, e.g., "5A5A Opt-H 1SCF.mop" Use keywords:
CHARGE=n HTML OUTPUT EPS=78.4 INVERT MOZYME GRADIENTS 1SCF
INVERT reverses all the optimization flags, so the hydrogen atoms optimization flags are set "off" and the optimization flags for all other atoms are set "on."
Run this job, and look at the section in the output headed "LARGEST ATOMIC GRADIENTS" Large gradients indicate a large difference between the calculated local environment of an atom and the local environment in the PDB file. Most PDB files do not distinguish the C-OH and C=O bond-lengths in a carboxylate, -COOH, group, but semiempirical methods predict a significant difference, typically the C-OH bond-length is ~1.31 Ångstroms and the C=O bond-length is 1.23 Ångstroms. Because of this, gradients on carboxylate carbon and oxygen atoms are often in the range 100 - 200 kcal.mol-1.Å-1. These can be ignored, unless they are over 200 kcal·mol-1.Å:-1.
Bond-lengths for sulfate and phosphate are not well reproduced, errors of up to 0.1 Å are not uncommon.
Look for any remaining large gradients, those in the range 150 - 200 kcal·mol-1.Å-1. If any exist, look at their local environment using JSmol. In particular, if the atom is in a heterogroup or in a modified residue, check the hydrogenation carefully. Most hydrogenation errors can be found by this test.
In some proteins fragments of residues might be missing, for example a threonine residue might have the methyl group missing. Residues of this type can be detected by using keyword RESIDUES. Again, only correct errors of this type if they are considered to be important.
Resequencing puts all atoms into the standard PDB order, i.e., within each residue the first atom is the nitrogen, this is followed by the rest of the non-hydrogen atoms, then the hydrogen atoms.
When hydrogen atoms are added or deleted using the SITE keyword, resequencing is done automatically. To prevent this happening, add NORESEQ.
Two protein systems can be compared to see how similar they are. Examples of pairs of systems that can be compared are:
Although the COMPARE operation does not need Firefox and JSmol, using them to examine details of the comparison is extremely helpful.
When two systems have different compositions, atoms that present in one structure but not in the other will be deleted. Atoms that are present in both structures, but have slightly different PDB labels, for example a Cδ1 in one structure might be a Cδ2 in the other, will be used in the comparison. Only the labels in the data structure will be changed; the labels in the reference structure will not be changed.
Proteins are normally defined using Cartesian coordinates. Internal coordinates are essential if an interatomic distance, angle, or dihedral is to be frozen. To do this, mixed coordinates are needed. All atoms, except the atom to be frozen, should be defined in Cartesian coordinates. The atom to be frozen should be defined in internal coordinates, then the optimization flag for the specific coordinate set to zero.
If the connectivity is entirely numeric, then the frozen atom can
have a connectivity that refers to one, two, or three atoms that have
not yet been defined.
If the connectivity uses atom labels,
the atoms used in the connectivity must already be defined.
Atoms can be defined by their atom numbers. This option is not
recommended, because the atom numbers might change if atoms are added or deleted. Instead, atom labels are recommended. Two
types - PDB labels and JSmol labels - are supported.
If an
atom has the PDB label "ATOM 1112 CZ ARG A 423", then the atom label
would be "CZ ARG A 423".
If an atom has the JSmol label
"[LEU]149:A.CD2", then the atom label would be "[LEU]149:A.CD2".
In
the limit, a connectivity for an atom could use all three types of
label: number, PDB, and JSmol.