Skip to content

Script definition to clean alignments in .sto format #2

@pmb59

Description

@pmb59

The script to clean alignments should take as input several alignments and produce cleaned ones. The script should:

  • use uv
  • Dedup sequences - keep only unique sequences
  • Implement option to remove duplicated sequencing (even if they have distinct ID)
  • IDs cannot be duplicated (do it at the end of the fixes)
  • Check which sequences match an existing family (covert .so to Fasta and run cmscan with all models). Maybe remove the sequences from the FASTA that match a CM.
  • Use the find_start_stop logic to fix coordiinates if needed
  • If it can’t fix coordinates (previous step) Blast search the sequence and try to replace it (with top match if above certain E-value, e.g. E-value < 0.01, otherwise remove sequence). If there are many top matches, choose one from the majority of species. Or pick one randomly.
  • Check for coordinate overlap across all given alignments. If two sequences in the .so overlap, then remove randomly one of them and leave the other because this causes problems down the line. We can define a minimum overlap % (e.g. th=80) for the removal to happen
  • Produce a report/log stating all changes in an easy to read text file.

Optional

  • add a GitHub action to run the script with some simple tests
  • warn if the alignment lacks a 2D line
  • Check if the length of the 2D is shorter than the non-gap length in the RF line.
  • add syntax validation for #=GC SS_cons

Notes:

  • duplicates are defined as same seq and sam ID
  • Use easel toolkit, whenever possible script. This tool is used in Rfam, so output should be mostly compatible with Rfam processes.
  • https://scikit.bio/docs/latest/generated/skbio.io.format.stockholm.html , biopython has some STOCK support too
  • see esl-alistat or esl-reformat
  • Reverse complement sequences are OK (even with coordinates of the form 299-99): we should NOT change the sequence, but we can still remove it if it has significant overlap with another

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions