Skip to content

Rijanhastwoears/DSEintro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

User info:

Name: Rijan

Fun fact: I have been chased by a king Cobra.

Pet: I had a parakeet named Patty a long, long time ago. It learned how to curse by being around my grandfather.

Favourite classes before joiing UTK:

  • Research credit hours with my Master's PI
  • Plant Genomics
  • Quantum Chemistry
  • Biochemisty
  • Technical Writing

cds_to_peptide.py

The main purpose of this script is to take orthogroup files, which contain lists of related peptide (protein) sequences from different species, and create corresponding "mirror" files containing the CDS (coding DNA sequences) for those same proteins.

In simpler terms, it translates lists of protein identifiers into lists of the DNA sequences that code for them. Files like these i.e. peptide orthogroups and DNA orthogroups are needed for various workflows in evolution analysis.

Here is a step-by-step breakdown of its workflow:

  • Argument Parsing: It starts by defining and parsing command-line arguments (--orthos, --cds, --pickle, --output) so the user can specify the locations for input and output files. It also validates that the provided paths exist. Loading Data into Memory:
  • It loads a "species code" dictionary from a pickle file. This dictionary maps short codes (e.g., Hsap) found in sequence headers to a full species name (e.g., homo_sapiens).
  • It then loads all the CDS sequence data from a directory of pickle files. Each pickle file contains a dictionary mapping sequence headers to their corresponding DNA sequences for a single species. A key performance optimization here is that it loads all of this data into memory at once to avoid slow, repetitive disk access later.
  • Parallel Processing: The script finds all the orthogroup files in the specified input directory (--orthos) and uses the multiprocessing library to process each file in parallel, making excellent use of multi-core CPUs to speed up the work significantly.

Sequence Translation (for each file):

For a given orthogroup file, it reads all the peptide sequence headers (lines starting with >). For each header, it identifies the species using the "species code" dictionary. It then looks up the correct CDS sequence from the appropriate in-memory species data that was loaded in step 2.

The original header and its newly found CDS sequence are stored in a structured NumPy array, which is an efficient way to handle this data.

Writing Output: Finally, it writes the collected headers and CDS sequences to a new file in the specified output directory. The output format mimics a FASTA file, with a header line followed by a sequence line.

About

This is my repo for DSE 511.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages