Protein Structure Comparison

What it is  What it can do How to do it Caveats

What it is (top)

Two protein structures can be compared to show their similarity and the differences.  The second of the two proteins is rotated and translated so as to minimize the Root Mean Square (RMS) difference between it and the first geometry.  If swapping pairs of atoms would reduce the RMS error, this is done. Differences are represented three ways. The simplest is to generate a list of atoms that have the largest difference in position; this is of limited use because some parts of a protein are very flexible, i.e. large geometric changes might be accompanied by only very small changes in energy.  A more important type of difference involves changes in bond-lengths.  Because covalent bonds have high force-constants, any significant change in bond length indicates a significant change in the local energy environment.  The third measure of difference is change in hydrogen bond energies.  In individual proteins there are often hundreds of hydrogen bonds, and the formation or loss of even a single one of these can change the heat of formation by several kcal.mol-1, so information about the creation or loss of a hydrogen bond can focus attention on possible problems in a structure.

What it can do (top)

Given two PDB files that represent the same protein, or two similar proteins, a structure comparison starts by identifying those atoms that are common to both systems.  Atoms that are in one but not the other system are identified and not used.  The remaining atoms are superimposed so as to minimize the RMS difference, and lists of the three types of differences printed.  A simple web-page is then written that allows the two superimposed systems to be visualized using JSmol.  With the three lists of differences and the JSmol graphics, the nature of the differences between the two systems can be quickly understood.

How to do it (top)

The recommended method for comparing protein structures is to use a two-line data-set.  As with most MOPAC data-sets, the first line consists of keywords; the second line, although optional, is strongly recommended, and should be a description of the two structures being compared.

On the first line, and for the most concise and informative results, use the set of keywords "GEO_DAT="text" GEO_REF="text" 0SCF HTML GEO-OK OUTPUT".  Their significance in this procedure is:

GEO_DAT="text"

The first of the two geometries is specified.  This geometry can be a MOPAC data set, an ARC file, or a PDB file, either straight from the Protein Data Bank, or an edited PDB file, or a PDB file made by MOPAC.  This keyword is not essential, in that the first geometry could be placed after the normal first three lines.  However, using the keyword allows an un-edited protein geometry to be read in.  Within the comparison procedure, the first of the two geometries will not be modified, that is, the atom-sequence will not be changed, although some atoms might be deleted in order to have a match with the second geometry, nor will the atomic coordinates be changed. This shows up in the results only; the file read in is not affected at all.

Atoms Deleted:

First, all atoms that have the same name, residue, chain, and residue number are retained. Of the remaining atoms, all hydrogen atoms are selected.  Within this set, if two hydrogen atoms have the same residue, chain letter, and residue number, then the atoms are retained.  All remaining atoms will be deleted, unless GEO-OK+ is also present, in which case all remaining hydrogen atoms are paired up.  This last option is only useful when a salt-bridge exists in only one of the two structures, in which case its atom label will refer to one residue in one structure and a different residue in the other.

The first geometry can be regarded as the reference geometry for the purposes of comparing structures.

GEO_REF="text"

 The second of the two geometries is specified. Within the comparison procedure, the this geometry will be modified so that the RMS difference between it and the reference geometry is minimized.  This might require that the sequence of atoms might be changed, some atoms might be deleted and the atomic coordinates will certainly be changed.

Changes to Atom Sequence:

By convention, most atoms in a PDB file will be in a precisely defined position, the commonest exception occurring in residues in which there is an unavoidable ambiguity in the order of the atoms.  For example, in phenylalanine Cd1 and Cd2 can be interchanged (swapped around) and the system could still comply with the PDB convention. Pairs of atoms of this type will be interchanged if that would result in the RMS difference between the two structures being reduced.

If, when there are two or more entire protein chains present, and the chains are in a different sequence from that in the reference geometry, then the chains will be re-arranged to suit the reference geometry.  The same will be done with hetero groups and ligands: they will be re-arranged to suit that in the reference geometry.

0SCF  and HTML

The presence of both keywords 0SCF and HTML indicates that the job is not to continue, and that after the two structures have been compared the job is to stop.

GEO-OK

This keyword is optional here.  By default, if there are any atoms in one system that are not present in the other system, no comparison will be made.  Instead, the differences will be printed and the job stopped.  If GEO-OK is present, then the systems will be trimmed so that only those atoms that are common to both systems are retained.  These will then be used in the comparison.

OUTPUT

This keyword is optional here, but is strongly recommended.  When OUTPUT is present, the large lists of atomic data are not generated; this makes reading the output file much easier.

 

Caveats (top)

When comparing similar PDB structures that include water molecules, there is a high probability that the residue sequence numbers will be different in the two systems.  Unless the water molecules are deleted, the resulting RMS difference between the two structures will be artificially high.

If the name of the substrate, if present, is different in the two systems, then the substrate will be removed automatically.  To prevent this, use the keywords RESIDUES0 and XENO.  This will allow the atom-labels in the first geometry to be modified to suit the second geometry.  If the first geometry is not a PDB file, other keywords, specifically START_RES and CHAINS will likely be needed in order to preserve the other atom-labels.