Skip to content

Conversation

@mathuin
Copy link
Collaborator

@mathuin mathuin commented Jul 29, 2016

The previous version of the occupancy select code was incomplete and generated incorrect results.

The code was refactored and now generates more accurate results.

mathuin added 5 commits July 25, 2016 16:45
After consultation with the PIs, specific residues were identified as tests for the occupancy selection code.
Found a syntax error, some ugly whitespace, and decimal/float confusion.
Previous version did not remove undesirable atoms.

BioPython has the Select class for this kind of task.

Rewrote the code to pass the tests.
Rebuilt structure after occupancy selection, removed unnecessary altloc selection, and added ability to delete residues by old_id.  All of this necessitated rebuilding the fixture.
This approach allows for additional proteins to be tested.
"""
This class implements the occupancy awareness selection.

Each disordered residue contains one or more disordered atom structures. These structures contain one or more atoms, each with its own altloc and occupancy values. As each atom associated with a particular altloc has its own occupancy value, the altloc with the highest average occupancy value is identified. Atoms with other altloc values are removed from the structure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pep8 violation, 80 characters per line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file has 173 PEP8 violations, most of which are more interesting than line-length.

I will wrap the comments. :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, I had no idea. The actual source of my gripe was that GH was cutting off the line so I couldn't read the whole thing. Thanks for changing it though :-)

mathuin and others added 20 commits July 29, 2016 16:23
Comments were too long, so I wrapped them.

Commands that were too long lose coherency so I left them.

Documented the vars() trick for future me and others.
1EJG was observed to not work with the new code, so here it is.

The test as committed fails, which is to be expected.  Now I have to fix it!
Because the occupancy requirements cross residue boundaries, the check
is done at the chain level.  Shims have been added to resolve invalid
residue issues at an earlier state, which may improve performance.
Not all disordered atoms were being examined.  This has been corrected.

The res_occ variable was being overwritten, causing loss of information.

Some debugging that was helpful was left in but commented.  In the future, perhaps a logger should be passed to the class.
Debugging this is hard without logging.
Relevant changes:
 * ProcessPDB.log is now optional.
 * Input only accepted via stdin at the moment.
The logging feature is now enabled from the command line.

The hetflags and not-in-AA3to1 detection and removal is now enabled.

Finally, the altlocs of all kept atoms are set to ' '.
The code that checked for missing atoms was rewritten to catch cases where the best altloc choice was incomplete.  Less missing-atom residues will result.
Sometimes no altloc is good.  Let's count those and report back.
Forgot to disable the writing of postocc PDB files.
BioPython made a subtle change to their DSSP code, so we have to change
as well.  Also, the B factor issue reported by Dale should be fixed.
B factors are real, so they should use real division, not floor division.

See code.o.o #18261
The same issue impacting B factor with respect to hydrogens is also
affecting occupancy.  This will fix that issue.
Merge mathuin/pgd:Remove-undesirable-atoms into GiriB/pgd:Remove-undesirable-atoms
Adds a FileHandler and StreamHandler for separate logging
to File and Console respectively.
Separate logging to File and Console
The PGDSelect class is used for selecting the most appropriate atoms
based on occupancy.  It has been modified to ignore hydrogen atoms when
performing this calculation and to not accept those atoms when
pre-processing the residues.  This should remove hydrogen atoms from
the database.
The PGDSelect class was getting big and hard to maintain inside the
main script so I put it in its own file.  The AA3to1 check was disabled
as it no longer worked without easy access to that variable.
Turns out sometimes the only atoms that are disordered are hydrogens
and since we're removing those, there's no reason to consider those
residues disordered.
Another side-effect of stripping hydrogen from residues and handling
those residues for which they are no longer disordered.

Also improved debugging for protein, chain and residue models.
Some residues have icodes -- 199A instead of just 199.

The old code did not differentiate between residues based on icodes.

Now it does!
BioPython has translations between three-letter residue types
and single-letter codes.  So does PGD.  I removed PGD's version
and used BioPython's version instead.  Fewer lines of code, same
functionality.
Now that we're stripping hydrogen atoms in the select, we can
stop lokoing for them in parseWithBioPython.
The hydrogen check was changed to look for hydrogen or deuterium.  It was cleaner
to add it as its own method at this point.

Previously we prefiltered ATOM lines that had no valid amino acids and HETATM lines
with valid amino acids.  This prefiltering is no longer necessary and has an negative
impact on processing efficiency.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants