Accuracy

There are many approximations used by MOPAC that cause errors relative to a hypothetical exact solution of the many-body Schrodinger equation for electrons and nuclei. This page is an overview of these approximations so that users can apply MOPAC appropriately and with reasonable expectations. There is also a list of discussions about MOPAC’s accuracy and reliability that has been preserved from its old website.

In the broader context of atomistic simulation, the semiempirical quantum mechanics (SQM) models used by MOPAC are a middle ground in cost, accuracy, and transferability. Here, “transferability” generically refers to the ability of a model to retain its accuracy beyond its training data. First-principles (i.e. ab initio) quantum mechanics (QM) calculations are typically at least a thousand times more expensive than an equivalent SQM calculation, but they are usually more accurate and more transferable. For example, density functional theory (DFT) calculations with a double-zeta or triple-zeta Gaussian basis set and the B3LYP density functional typically have errors that are two to five times smaller than any SQM model in MOPAC. Molecular mechanics (MM) models based on a force field (i.e. interatomic potential) are typically at least a thousand times less expensive than an equivalent SQM calculation. However, the transferability of force fields is usually quite low. They can be more accurate than an SQM calculation, but that accuracy can rapidly degrade if a force field is used for systems that are not well represented by its training data. For the application of force fields to long molecular dynamics trajectories, it can be challenging to monitor or estimate the overall accuracy of a force field.

For a modern overview on the role of SQM models in atomistic simulation, see this perspective paper. SQM models are good for efficient exploration, either interactively by hand or with high-throughput automation. If your computational budget is large enough, you should use the highest level of QM that you can afford to validate SQM results and improve the accuracy and reliability of predictions. If you plan to explore a restricted family of molecules or materials very extensively, then you should consider using SQM and QM results to validate a force field for more efficient MM simulations or fit a custom force field for your application.

Note that recent developments in machine learning (ML) research have greatly improved the tools for fitting new force fields for MM-based applications. ML research has also generated very extensive sets of atomistic training data such as OMol25 that are being used to improve the transferability of big-data force fields. In principle, SQM model development can also benefit from the data and fitting tools being generated by ML research. However, there has been far less research activity in SQM than MM for many decades now, so the progress of SQM in this direction has been much slower.

MNDO model approximations #

There have been several research lines of SQM model development, and the modified neglect of diatomic overlap (MNDO) model family has been the pinnacle of SQM-based thermochemistry since its development in a paper by Thiel and Dewar in 1977. See this review article for the history and technical details of the MNDO model family and its neglect of diatomic differential overlap (NDDO) approximation. Except for its INDO/S feature, all of the SQM models used by MOPAC are in the MNDO model family.

Atomistic simulation methods with empirical components—force fields, SQM models, and DFT functionals—can all be considered as variations on a basic technical theme. Formally, there exists an exact force field describing the ground-state potential energy surface for any configuration of atoms. This exact force field can be systematically approximated with a model force field by increasing the complexity of its model form while increasing the amount of training data until there is enough data to fit all of the model parameters. However, increasing the complexity of the force field will increase the costs both to train it and to evaluate it. The same type of formal statements and empirical trade-offs exist for SQM models and DFT functionals: a many-electron Hamiltonian can be formally mapped to a Hartree-Fock ground state in a minimal basis using many-body canonical transformations or it can be formally mapped to a non-interacting system of Kohn-Sham electrons using DFT and a density functional. These exact mappings are computationally intractable just like the exact force field, but they can be approximated by SQM models and practical DFT functionals.

An important distinction between SQM models and DFT functionals is that DFT is used to approximate electron correlation effects beyond a mean-field calculation in a large basis set while SQM models use a small basis set and also rely on the model to approximate the complete basis set limit. SQM models are thus used to correct for more sources of error than DFT functionals, but they also tend to have more free parameters. Although available SQM models are not as accurate as popular DFT functionals, that is more a reflection of relative development effort and not a fundamental limit. DFT functional development has been an extremely popular research activity in chemistry, condensed-matter physics, and materials science for several decades, while SQM model development has been kept alive by only a handful of people and research groups since its popularity among method developers faded in the 1970s.

An important additional limitation of MNDO-family models is that they do not directly use diatomic model parameters. Most parameters are specific to individual chemical elements and describe their average behavior in all chemical environments. Interatomic matrix elements between pairs of elements come either from a Slater-type orbital approximation of their valence atomic orbitals, or a Klopman-Ohno approximation of their atomic charge distributions. In either case, a single fixed number—a Slater exponent or a charge radius—is used to characterize the average electronic “size” of each atom. The NDDO approximation also neglects electronic charges from overlapping pairs of atomic orbitals on different atoms. Some of the later MNDO-family models like PM6 and PM7 added a limited set of diatomic interatomic repulsion terms to fix error outliers and also included diatomic parameters indirectly through their use of dispersion models such as Grimme’s D3 model. While the lack of diatomic parameters limits the accuracy and transferability of MNDO-family models, it was an essential limitation that kept the number of model parameters low enough to be fit to the scarce experimental data that was available in the 1980’s.

A nuance of MNDO-family models that complicates their use is that they were designed to approximate heats of formation rather than ground-state total energies. A Hartree-Fock calculation is performed on a MNDO-family many-electron Hamiltonian, and the ground-state total energy of that calculation is assigned to be the heat of formation at a standard temperature (25°C) and pressure. The simple reason for this design decision was that experiments directly measure heats of formation and not total energies. If the model output was a total energy, then additional corrections for atomic vibrations would be needed to approximate the heat of formation. However, the direct modeling of heat is formally awkward because it is a statistical property that only relates to a specific atomic configuration if the thermal distribution over atomic configuration is tightly peaked around a single equilibrium configuration. The heat of formation for a non-equilibrium atomic configuration doesn’t make formal sense without introducing constraints that cause it to be an equilibrium configuration (e.g. in an external potential). Practically, the vibrational corrections that map between such heats and total energies are relatively smooth and featureless contributions to the potential energy surface, and heats and total energies can be approximately interchanged by adding atomic corrections.

Reference data coverage #

The accuracy of MNDO-family SQM models for a specific application depends on both how accurately they fit their training data and how similar the application is to the training data. The training data used by modern MNDO-family models (PM6, PM7, and PM6-ORG) is available in the /data subdirectory of the MOPAC repository and summarized on the website for PM6/PM7 and PM6-ORG.

Because of limitations in the fitting software, MOPAC’s training data contains only molecules and no periodic systems such as polymers or crystals. These molecules are mostly organic, with elements beyond HCNO usually incorporated as ions complexed by organic ligands. As such, some elements are not well described in valences that are common in a materials setting but rare in a molecular setting. The worst-case scenario tends to be elemental metallic crystals, with ground states that collapse into atom-localized orbitals rather than delocalized metallic energy bands. Semiconductors have smaller errors than metals, but their electronic band gaps are systematically too large, which is typical for Hartree-Fock calculations of periodic systems. Molecular solids, covalent networks, and ionic solids are typically as accurate as molecular calculations.

Because MNDO-family models have been primarily fit to heats of formation (in addition to equilibrium geometries, dipole moments, and ionization potentials), they rely on additional model corrections to describe weak, inter-molecular interactions such as dispersion and hydrogen bonding. Thus, in predicting conformational energies or inter-molecular binding energies, they are relying somewhat on model transferability, while other SQM models such as the GFN models in the xTB software are primarily fit to describe these weak, inter-molecular interactions and tend to be more accurate for such applications.

Even with all of this information, it is challenging to predict how well an SQM model will work for a specific application. Conceptually, it is possible to design and fit SQM models with an explicitly characterized success rate for specific simulation tasks, as discussed in this paper. However, this requires SQM model development to use modern, advanced statistical concepts, which is currently beyond its limited resources and technical capacity. Unfortunately, this places more of a burden on users to determine the efficacy of an SQM model for their applications.

COSMO solvent approximations #

MOPAC uses the COSMO implicit solvent model to simulate solvated molecules. For details of the model and its implementation in MOPAC, see the COSMO paper.

The interactions between a solvent and solute can primarily be categorized as electrostatic and statistical. For solute molecules that have a net charge, dipole moment, or highly non-uniform charge distribution, the largest solvation effects are electrostatic in nature. The solvent polarizes in response to the solute’s charge distribution, and this response is approximated by the COSMO model as an ideal conductor located on a solvent accessible surface (SAS) of the solute molecule. The electrostatic energy between the SAS conductor and solute is then modified to approximate the energy associated with a finite dielectric constant. The SAS is formally defined by a union of atom-centered spheres with radii proportional to each atom’s Van der Waals radius, but the SAS is approximated by tessellation to a triangular mesh for numerical work. The tesselation can be adjusted by keywords in the MOPAC input file to test numerical sensitivity.

Besides electrostatic effects, the solute and solvent interact through steric effects and weak inter-molecular interactions that distort each other’s statistical equilibrium. These effects tend to be smaller than electrostatics but are much more difficult to model in detail. Empirical models of these statistical solvent effects usually add a term to the heat of formation that is proportional to the SAS surface area, with the proportionality constant fit to experimental data. MOPAC does not do this for you, but it reports the SAS surface area in the output file, so that you can add such corrections yourself during post-processing.

Numerical approximations #

MOPAC takes a pragmatic approach to numerics that should be suitable for most purposes, although its numerics tend to be a bit looser than first-principles QM software. Numerical errors in SQM and QM calculations are usually much smaller than the errors resulting from model approximations.

Users should be careful when evaluating finite-difference-based derivatives of numerical outputs from MOPAC. The atomic forces evaluated by MOPAC are only partially analytical derivatives and do typically contain terms that are evaluated by finite differences. MOPAC evaluates all second derivatives (vibrational matrix elements) as finite-difference derivatives of the atomic forces. While MOPAC has keywords that can be used to adjust many numerical details, the finite-difference step sizes cannot be adjusted without altering the source code. The default convergence tolerances of the self-consistent field (SCF) cycle were chosen to achieve reasonable overall accuracy in finite-difference derivatives, but that source of error can be further reduced by setting tighter SCF tolerances using keywords in the MOPAC input file.

Calculations of periodic systems have additional sources of error. MOPAC does not use Brillouin zone sampling, so the only way to reduce finite-size effects in periodic calculations is to expand the unit cell into supercells of increasing size. MOPAC also does not have a proper periodic electrostatics solver (e.g. Fourier transforms and Ewald summation) and relies on the Wolf approximation for both periodic systems and MOZYME calculations. The errors of this approximation are usually small in practice, but there is no strict error bound and error outliers might occur for unphysical charge distributions that strongly deviate from local charge neutrality assumptions.