Remove undesirable atoms #62

mathuin · 2016-07-29T21:26:24Z

The previous version of the occupancy select code was incomplete and generated incorrect results.

The code was refactored and now generates more accurate results.

After consultation with the PIs, specific residues were identified as tests for the occupancy selection code.

Found a syntax error, some ugly whitespace, and decimal/float confusion.

Previous version did not remove undesirable atoms. BioPython has the Select class for this kind of task. Rewrote the code to pass the tests.

Rebuilt structure after occupancy selection, removed unnecessary altloc selection, and added ability to delete residues by old_id. All of this necessitated rebuilding the fixture.

This approach allows for additional proteins to be tested.

pop · 2016-07-29T21:34:13Z

pgd_splicer/ProcessPDBTask.py

+    """
+    This class implements the occupancy awareness selection.
+
+    Each disordered residue contains one or more disordered atom structures.  These structures contain one or more atoms, each with its own altloc and occupancy values.  As each atom associated with a particular altloc has its own occupancy value, the altloc with the highest average occupancy value is identified.  Atoms with other altloc values are removed from the structure.


pep8 violation, 80 characters per line.

This file has 173 PEP8 violations, most of which are more interesting than line-length.

I will wrap the comments. :-)

Heh, I had no idea. The actual source of my gripe was that GH was cutting off the line so I couldn't read the whole thing. Thanks for changing it though :-)

Comments were too long, so I wrapped them. Commands that were too long lose coherency so I left them. Documented the vars() trick for future me and others.

1EJG was observed to not work with the new code, so here it is. The test as committed fails, which is to be expected. Now I have to fix it!

Because the occupancy requirements cross residue boundaries, the check is done at the chain level. Shims have been added to resolve invalid residue issues at an earlier state, which may improve performance.

Not all disordered atoms were being examined. This has been corrected. The res_occ variable was being overwritten, causing loss of information. Some debugging that was helpful was left in but commented. In the future, perhaps a logger should be passed to the class.

Debugging this is hard without logging.

Relevant changes: * ProcessPDB.log is now optional. * Input only accepted via stdin at the moment.

The logging feature is now enabled from the command line. The hetflags and not-in-AA3to1 detection and removal is now enabled. Finally, the altlocs of all kept atoms are set to ' '.

The code that checked for missing atoms was rewritten to catch cases where the best altloc choice was incomplete. Less missing-atom residues will result.

Sometimes no altloc is good. Let's count those and report back.

Forgot to disable the writing of postocc PDB files.

BioPython made a subtle change to their DSSP code, so we have to change as well. Also, the B factor issue reported by Dale should be fixed.

B factors are real, so they should use real division, not floor division. See code.o.o #18261

The same issue impacting B factor with respect to hydrogens is also affecting occupancy. This will fix that issue.

Merge mathuin/pgd:Remove-undesirable-atoms into GiriB/pgd:Remove-undesirable-atoms

Adds a FileHandler and StreamHandler for separate logging to File and Console respectively.

Separate logging to File and Console

The PGDSelect class is used for selecting the most appropriate atoms based on occupancy. It has been modified to ignore hydrogen atoms when performing this calculation and to not accept those atoms when pre-processing the residues. This should remove hydrogen atoms from the database.

The PGDSelect class was getting big and hard to maintain inside the main script so I put it in its own file. The AA3to1 check was disabled as it no longer worked without easy access to that variable.

Turns out sometimes the only atoms that are disordered are hydrogens and since we're removing those, there's no reason to consider those residues disordered.

Another side-effect of stripping hydrogen from residues and handling those residues for which they are no longer disordered. Also improved debugging for protein, chain and residue models.

Some residues have icodes -- 199A instead of just 199. The old code did not differentiate between residues based on icodes. Now it does!

BioPython has translations between three-letter residue types and single-letter codes. So does PGD. I removed PGD's version and used BioPython's version instead. Fewer lines of code, same functionality.

Now that we're stripping hydrogen atoms in the select, we can stop lokoing for them in parseWithBioPython.

The hydrogen check was changed to look for hydrogen or deuterium. It was cleaner to add it as its own method at this point. Previously we prefiltered ATOM lines that had no valid amino acids and HETATM lines with valid amino acids. This prefiltering is no longer necessary and has an negative impact on processing efficiency.

mathuin added 5 commits July 25, 2016 16:45

Changed tests to reflect actual results.

dcc6290

After consultation with the PIs, specific residues were identified as tests for the occupancy selection code.

Minor tweaks to tests.

9130e24

Found a syntax error, some ugly whitespace, and decimal/float confusion.

Refactored occupancy select code.

3a2d3c8

Previous version did not remove undesirable atoms. BioPython has the Select class for this kind of task. Rewrote the code to pass the tests.

Minor fixes, rebuilt fixtures.

45f7284

Rebuilt structure after occupancy selection, removed unnecessary altloc selection, and added ability to delete residues by old_id. All of this necessitated rebuilding the fixture.

Cleaned up tests.

d0c2479

This approach allows for additional proteins to be tested.

pop reviewed Jul 29, 2016
View reviewed changes

mathuin and others added 20 commits July 29, 2016 16:23

Minor modifications for line length and clarity.

835d765

Comments were too long, so I wrapped them. Commands that were too long lose coherency so I left them. Documented the vars() trick for future me and others.

Added new structure for tests.

1c7639a

1EJG was observed to not work with the new code, so here it is. The test as committed fails, which is to be expected. Now I have to fix it!

Fixed occupancy code again.

34389f2

Because the occupancy requirements cross residue boundaries, the check is done at the chain level. Shims have been added to resolve invalid residue issues at an earlier state, which may improve performance.

Added check for structure parsing.

1562602

Added logging support.

f397b7b

Debugging this is hard without logging.

Added argparse support for logging.

744e3b5

Relevant changes: * ProcessPDB.log is now optional. * Input only accepted via stdin at the moment.

Enabled logging and pre-cleaning.

26c1b14

The logging feature is now enabled from the command line. The hetflags and not-in-AA3to1 detection and removal is now enabled. Finally, the altlocs of all kept atoms are set to ' '.

Fixed missing atom check.

a284348

The code that checked for missing atoms was rewritten to catch cases where the best altloc choice was incomplete. Less missing-atom residues will result.

Properly trapped no-acceptable-altloc case.

23180f5

Sometimes no altloc is good. Let's count those and report back.

Disable debugging code.

a184dca

Forgot to disable the writing of postocc PDB files.

Fix for B factors, plus BioPython update.

d42ae8f

BioPython made a subtle change to their DSSP code, so we have to change as well. Also, the B factor issue reported by Dale should be fixed.

Fixed division for B factors.

a245ffb

B factors are real, so they should use real division, not floor division. See code.o.o #18261

Fixed occupancy like B factor.

d5d511a

The same issue impacting B factor with respect to hydrogens is also affecting occupancy. This will fix that issue.

Merge pull request #1 from mathuin/remove-undesirable-atoms

1382d31

Merge mathuin/pgd:Remove-undesirable-atoms into GiriB/pgd:Remove-undesirable-atoms

Fix logging issue

6eb80dc

Adds a FileHandler and StreamHandler for separate logging to File and Console respectively.

Merge pull request #2 from GiriB/remove-undesirable-atoms

617d1fb

Separate logging to File and Console

Moved PGDSelect into its own file.

a5a1fa3

The PGDSelect class was getting big and hard to maintain inside the main script so I put it in its own file. The AA3to1 check was disabled as it no longer worked without easy access to that variable.

Handled case where H is only disordered atom.

bac8c0d

Turns out sometimes the only atoms that are disordered are hydrogens and since we're removing those, there's no reason to consider those residues disordered.

mathuin added 7 commits October 24, 2016 15:55

continue not break doh

4ca6e1c

Cleaner handling of ordered residues.

fff0147

Another side-effect of stripping hydrogen from residues and handling those residues for which they are no longer disordered. Also improved debugging for protein, chain and residue models.

Added support for icodes in select code.

bfb6ab1

Some residues have icodes -- 199A instead of just 199. The old code did not differentiate between residues based on icodes. Now it does!

Refactored AA_CHOICES to use BioPython values.

da1ad95

BioPython has translations between three-letter residue types and single-letter codes. So does PGD. I removed PGD's version and used BioPython's version instead. Fewer lines of code, same functionality.

Removed hydrogen checks.

9434730

Now that we're stripping hydrogen atoms in the select, we can stop lokoing for them in parseWithBioPython.

Removed MAINTAINER line.

6c32999

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove undesirable atoms #62

Remove undesirable atoms #62

Uh oh!

mathuin commented Jul 29, 2016

Uh oh!

pop Jul 29, 2016

Uh oh!

mathuin Jul 29, 2016

Uh oh!

pop Jul 29, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Remove undesirable atoms #62

Are you sure you want to change the base?

Remove undesirable atoms #62

Uh oh!

Conversation

mathuin commented Jul 29, 2016

Uh oh!

pop Jul 29, 2016

Choose a reason for hiding this comment

Uh oh!

mathuin Jul 29, 2016

Choose a reason for hiding this comment

Uh oh!

pop Jul 29, 2016

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants