Conventional semiempirical methods use matrix algebra methods, most of which scale as the third power of the number of atoms. A consequence of this is that modeling of systems of more than 1,000 atoms is impractical.
By using Localized Molecular Orbitals (LMO), the Self-Consistent Field equations can be solved in a time proportional to the size of the system. Increasing the speed of the SCF moved the slow step to other parts of the calculation, so changes were made to all the remaining time-consuming steps to make them more efficient. Changes were also made that reduced the memory demand. The result of all these changes is that routine modeling operations can now be carried out on over 90% of all the entries in the Protein Data Bank (PDB), previously only about 10% of all entries could be used.
Geometric functions
Most of the common modeling functions have been enhanced by the MOZYME function, these include:
Single-point calculations.
A general impression of the change in behavior can be obtained by comparing the resources required for various single-point, i.e., single SCF, calculations. Other operations, such as geometry optimization, consist of repeated SCF calculations. All calculations shown here were done using a Windows XP Pro computer with a 4000+ or 2.41 GHz Athlon processor.
The test systems used here were constructed from a single large protein, cut into pieces of the appropriate size.
Comparison of MOPAC and MOZYME computer resources required for a 1SCF calculation
No. of atoms
Time for 1SCF (minutes)
Memory (megabytes)
Ratio MOPAC/MOZYME
MOZYME
MOPAC
MOZYME
MOPAC
time
memory
400
0.2
2.3
17
101
12
6
800
0.4
25.4
33
391
64
12
1000
0.9
59.0
43
607
65
14
1500
2.3
222.6
78
1,424
97
18
2000
2.9
(527)¹
102
(2,532)
(182)
(25)
3000
5.6
(1,781)
164
(5,696)
(318)
(35)
5000
15.8
(8,244)
367
(15,822)
(522)
(43)
10000
56.9
(65,956)
1,007
(63,288)
(1,159)
(63)
15000
230.3
(222,600)
1,026
(142,400)
(966.4)
(139)
18000
308.4
(384,653)
1,262
(205,056)
(1,247)
(162)
1: Numbers in parentheses are estimated
After the first SCF calculation is done, all subsequent SCF calculations run much faster:
Geometry optimization:
This includes single point, restricted, and unrestricted, for molecular assemblies and for solids.
Transition state location.
This operation runs at about half the speed of geometry optimization.
Transition state refinement.
This operation runs at about 20% of the speed of geometry optimization.
Transition state characterization.
The
FORCE calculation is a rapid operation now, running in a
small fraction of the time required for geometry optimization.
Intrinsic Reaction Coordinate.
Unlike the other operations, this is very time consuming,
taking 10 50 times longer than a geometry optimization.
Reaction paths, grids, etc.
These can be run using only the atoms of interest, in which case
it is very fast, or using all atoms at every step, in which case it is slow. The time for each point is
only slightly less than a geometry optimization.
Accuracy
Properties of proteins (structure, energetics, reactions, etc.) are reproduced with good accuracy using PM7. For details of PM7, see:
http://link.springer.com/article/10.1007%2Fs00894-012-1667-x
For accuracy of PM7 compared to other methods, see:
For accuracy of modeling protein properties, see: HTTP<add URL here>
Limitations
The current MOZYME function is restricted to closed-shell systems
for which a Lewis structure can be generated. Only Restricted Hartree Fock methods can be used, ROHF and
UHF are not available. This means that well-known radicals, such as ethyl, C2H5·, and
triphenylmethyl, (C6H5)3C·, cannot be modeled. An exception to this rule
concerns biomolecules that contain transition metals. These are normally treated as open-shell systems, but
if the main focus of interest is on geometries and energetics, they can be modeled using closed shell methods.
Electronic properties:
Only the ground electronic state can be modeled. This excludes all excited electronic states.
Structures:
Thus far, no structures have been found that cannot be modeled using MOZYME.
The maximum number of atoms attached to a given atom is 15.
Upper size limit for single point calculation (usually for generating forces):
18,000 atoms.
Upper size limit for transition state operations, the IRC and the DRC:
The same as geometry optimization.
Utilities
Detection of errors in the starting geometry. Of their nature, proteins are large molecules,
and making sure that every atom has the correct valency is tedious. Every program for adding hydrogen
atoms that was tested has faults, and even one fault can ruin a simulation. All faults resulting in incorrect
valences can be identified by keyword CHARGES. This prints out the atom numbers and types of all charged atoms.
These can then be quickly checked using a GUI, and any corrective action then taken.
Identification of residues.
Using RESIDUES, the residue sequence can be calculated and printed. This allows unrecognized residues
to be identified. If they represent user errors, then corrective action can be taken. If they represent genuine hetero groups,
they can be re-defined as modified residues using keyword
XENO. If they are the result of severe distortions generating spurious bonds,
these bonds can be broken using keyword CVB.
Identification of ionized residues.
During a geometry optimization, hydrogen atoms might migrate to form a salt bridge.
All resulting ionized residues are indicated by a trailing "+" or "" sign. Partial charges on residues can also be printed.
Ramachandran angles.
All Ramachandran angles are printed when
both
RAMA is present and residues are identified.
Atom and residue gradients.
Total forces acting on atoms and on residues can be printed. This allows the
rapid identification of geometric distortions, particularly in X-ray structures.
Restart option.
Because calculations involving proteins often take many days, a robust restart
function is provided. During a long run, the restart file(s) can be used to run a zero-SCF, this allows a detailed
examination of the state of the running system to be made, without stopping the run.
Read PDB files. Protein data bank files are automatically recognized and read in. Most of the PDB
format is supported.
Write PDB files. When PDBOUT or RESEQ is used, the geometry is printed out in PDB format.
Re-define chains and residue numbers
Resequence. When hydrogen atoms are added, they are usually put at the end of the data set.
This results in a non-standard PDB format. The standard PDB format can be generated using RESEQ. This puts all atoms in the same residue together.
If two geometries, from entirely different PDB files are compared, the atoms in each file must first be put into the same sequence,
so that there is an exact one-to-one correspondence. Unavoidable ambiguities in the PDB format definition require some extra re-arranging
of atoms to ensure maximum coincidence. This is done using keyword GEO_REF.
Getting started using MOZYME
Because the issues involved in manipulating proteins are very different from those involving small molecules, all users are strongly recommended to spend some time familiarizing themselves with the
MOZYME utilities.
General guidelines
The initial geometry should be uncharged. In many proteins there are charged sites such as carboxylate anions and ammonium cationic sites; often these form internal salt bridges. Nevertheless, in order to avoid any ambiguity about the electronic state of the system, all charges should be neutralized during the preparation of a starting geometry
by adding or removing hydrogen atoms
. Even systems, such as bacteriorhodopsin,
that have a definite net charge, should be first represented as the neutral species. The only exceptions are when a strong monovalent
ion such as potassium or tetra-alkylammonium cation, or a halide anion is present, in which case the ion should be correctly represented.
Preconceived ideas of charge are often wildly inaccurate. These include generalizations such as "all acid functions should be ionized" and "all arginine residues should be protonated." If these ideas were implemented, the resulting net charge on a protein would likely be nonsense.
This policy of requiring the initial structure to be uncharged might appear to be restrictive, but during a geometry optimization salt bridges will often form spontaneously, so the initial uncharged species is quickly replaced by a more realistic system. After the geometry is optimized, any ions the user considers important can then be added. If the entire protein has to have a certain net charge, protons can be added or removed as necessary. Thus bacteriorhodopsin would have the retinal Schiff base nitrogen atom protonated to give bR+.
The end result is a well-defined protein with the correct charged sites.
Check hetero groups. Most utilities that add hydrogen atoms do a good job of handling the common residues, but a poorer job with hetero systems, such as the heme group. Because of this, almost all the errors in adding hydrogen atoms occur with hetero groups. Therefore, these groups should be checked carefully to verify that the degree of saturation is correct.
Identify and characterize uncommon residues. If a residue is not recognized by MOPAC, check that it is correct.
If it should be one of the 20 common residues, either add the missing atoms, or use
XENO to allow the residue to be recognized.
If it is a modified residue, such as the retinal group in bR, use XENO to re-define it as the parent residue
Do not ignore unrecognized residues they usually indicate a fault in the data set.
Trust that the charged sites printed are correct. The method for calculating charged sites is very robust.
If it says that a site is charged, believe it. In older versions of MOPAC, charged sites were not printed,
and about twenty users reported errors in MOPAC, saying that the net charge it wanted was incorrect.
In every case, errors were found in the users data set, due to simple errors in assigning hydrogen atoms.
Although the charges reported by MOPAC are always correct, minor problems might be encountered in extended, delocalized,
p systems. Thus if the hydrogen on the phenol oxygen in tyrosine is missing,
the charge might be placed on the Cg rather than on the oxygen,
that is, the Lewis structure MOPAC generated was for the quinone rather than the phenolic resonance structure.