-
Notifications
You must be signed in to change notification settings - Fork 5
Remove undesirable atoms #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Remove undesirable atoms #62
Conversation
After consultation with the PIs, specific residues were identified as tests for the occupancy selection code.
Found a syntax error, some ugly whitespace, and decimal/float confusion.
Previous version did not remove undesirable atoms. BioPython has the Select class for this kind of task. Rewrote the code to pass the tests.
Rebuilt structure after occupancy selection, removed unnecessary altloc selection, and added ability to delete residues by old_id. All of this necessitated rebuilding the fixture.
This approach allows for additional proteins to be tested.
pgd_splicer/ProcessPDBTask.py
Outdated
| """ | ||
| This class implements the occupancy awareness selection. | ||
|
|
||
| Each disordered residue contains one or more disordered atom structures. These structures contain one or more atoms, each with its own altloc and occupancy values. As each atom associated with a particular altloc has its own occupancy value, the altloc with the highest average occupancy value is identified. Atoms with other altloc values are removed from the structure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pep8 violation, 80 characters per line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file has 173 PEP8 violations, most of which are more interesting than line-length.
I will wrap the comments. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, I had no idea. The actual source of my gripe was that GH was cutting off the line so I couldn't read the whole thing. Thanks for changing it though :-)
Comments were too long, so I wrapped them. Commands that were too long lose coherency so I left them. Documented the vars() trick for future me and others.
1EJG was observed to not work with the new code, so here it is. The test as committed fails, which is to be expected. Now I have to fix it!
Because the occupancy requirements cross residue boundaries, the check is done at the chain level. Shims have been added to resolve invalid residue issues at an earlier state, which may improve performance.
Not all disordered atoms were being examined. This has been corrected. The res_occ variable was being overwritten, causing loss of information. Some debugging that was helpful was left in but commented. In the future, perhaps a logger should be passed to the class.
Debugging this is hard without logging.
Relevant changes: * ProcessPDB.log is now optional. * Input only accepted via stdin at the moment.
The logging feature is now enabled from the command line. The hetflags and not-in-AA3to1 detection and removal is now enabled. Finally, the altlocs of all kept atoms are set to ' '.
The code that checked for missing atoms was rewritten to catch cases where the best altloc choice was incomplete. Less missing-atom residues will result.
Sometimes no altloc is good. Let's count those and report back.
Forgot to disable the writing of postocc PDB files.
BioPython made a subtle change to their DSSP code, so we have to change as well. Also, the B factor issue reported by Dale should be fixed.
B factors are real, so they should use real division, not floor division. See code.o.o #18261
The same issue impacting B factor with respect to hydrogens is also affecting occupancy. This will fix that issue.
Merge mathuin/pgd:Remove-undesirable-atoms into GiriB/pgd:Remove-undesirable-atoms
Adds a FileHandler and StreamHandler for separate logging to File and Console respectively.
Separate logging to File and Console
The PGDSelect class is used for selecting the most appropriate atoms based on occupancy. It has been modified to ignore hydrogen atoms when performing this calculation and to not accept those atoms when pre-processing the residues. This should remove hydrogen atoms from the database.
The PGDSelect class was getting big and hard to maintain inside the main script so I put it in its own file. The AA3to1 check was disabled as it no longer worked without easy access to that variable.
Turns out sometimes the only atoms that are disordered are hydrogens and since we're removing those, there's no reason to consider those residues disordered.
Another side-effect of stripping hydrogen from residues and handling those residues for which they are no longer disordered. Also improved debugging for protein, chain and residue models.
Some residues have icodes -- 199A instead of just 199. The old code did not differentiate between residues based on icodes. Now it does!
BioPython has translations between three-letter residue types and single-letter codes. So does PGD. I removed PGD's version and used BioPython's version instead. Fewer lines of code, same functionality.
Now that we're stripping hydrogen atoms in the select, we can stop lokoing for them in parseWithBioPython.
The hydrogen check was changed to look for hydrogen or deuterium. It was cleaner to add it as its own method at this point. Previously we prefiltered ATOM lines that had no valid amino acids and HETATM lines with valid amino acids. This prefiltering is no longer necessary and has an negative impact on processing efficiency.
The previous version of the occupancy select code was incomplete and generated incorrect results.
The code was refactored and now generates more accurate results.