-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
The script to clean alignments should take as input several alignments and produce cleaned ones. The script should:
- use uv
- Dedup sequences - keep only unique sequences
- Implement option to remove duplicated sequencing (even if they have distinct ID)
- IDs cannot be duplicated (do it at the end of the fixes)
- Check which sequences match an existing family (covert
.sotoFastaand run cmscan with all models). Maybe remove the sequences from the FASTA that match a CM. - Use the find_start_stop logic to fix coordiinates if needed
- If it can’t fix coordinates (previous step) Blast search the sequence and try to replace it (with top match if above certain E-value, e.g. E-value < 0.01, otherwise remove sequence). If there are many top matches, choose one from the majority of species. Or pick one randomly.
- Check for coordinate overlap across all given alignments. If two sequences in the
.sooverlap, then remove randomly one of them and leave the other because this causes problems down the line. We can define a minimum overlap % (e.g. th=80) for the removal to happen - Produce a report/log stating all changes in an easy to read text file.
Optional
- add a GitHub action to run the script with some simple tests
- warn if the alignment lacks a 2D line
- Check if the length of the 2D is shorter than the non-gap length in the RF line.
- add syntax validation for
#=GC SS_cons
Notes:
- duplicates are defined as same seq and sam ID
- Use easel toolkit, whenever possible script. This tool is used in Rfam, so output should be mostly compatible with Rfam processes.
- https://scikit.bio/docs/latest/generated/skbio.io.format.stockholm.html , biopython has some STOCK support too
- see esl-alistat or esl-reformat
- Reverse complement sequences are OK (even with coordinates of the form 299-99): we should NOT change the sequence, but we can still remove it if it has significant overlap with another
Metadata
Metadata
Assignees
Labels
No labels