User info:

Name: Rijan

Fun fact: I have been chased by a king Cobra.

Pet: I had a parakeet named Patty a long, long time ago. It learned how to curse by being around my grandfather.

Favourite classes before joiing UTK:

Research credit hours with my Master's PI
Plant Genomics
Quantum Chemistry
Biochemisty
Technical Writing

`cds_to_peptide.py`

The main purpose of this script is to take orthogroup files, which contain lists of related peptide (protein) sequences from different species, and create corresponding "mirror" files containing the CDS (coding DNA sequences) for those same proteins.

In simpler terms, it translates lists of protein identifiers into lists of the DNA sequences that code for them. Files like these i.e. peptide orthogroups and DNA orthogroups are needed for various workflows in evolution analysis.

Here is a step-by-step breakdown of its workflow:

Argument Parsing: It starts by defining and parsing command-line arguments (--orthos, --cds, --pickle, --output) so the user can specify the locations for input and output files. It also validates that the provided paths exist. Loading Data into Memory:
It loads a "species code" dictionary from a pickle file. This dictionary maps short codes (e.g., Hsap) found in sequence headers to a full species name (e.g., homo_sapiens).
It then loads all the CDS sequence data from a directory of pickle files. Each pickle file contains a dictionary mapping sequence headers to their corresponding DNA sequences for a single species. A key performance optimization here is that it loads all of this data into memory at once to avoid slow, repetitive disk access later.
Parallel Processing: The script finds all the orthogroup files in the specified input directory (--orthos) and uses the multiprocessing library to process each file in parallel, making excellent use of multi-core CPUs to speed up the work significantly.

Sequence Translation (for each file):

For a given orthogroup file, it reads all the peptide sequence headers (lines starting with >). For each header, it identifies the species using the "species code" dictionary. It then looks up the correct CDS sequence from the appropriate in-memory species data that was loaded in step 2.

The original header and its newly found CDS sequence are stored in a structured NumPy array, which is an efficient way to handle this data.

Writing Output: Finally, it writes the collected headers and CDS sequences to a new file in the specified output directory. The output format mimics a FASTA file, with a header line followed by a sequence line.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
cds_to_peptides.py		cds_to_peptides.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

User info:

Favourite classes before joiing UTK:

`cds_to_peptide.py`

Sequence Translation (for each file):

About

Uh oh!

Releases

Packages

Languages

Rijanhastwoears/DSEintro

Folders and files

Latest commit

History

Repository files navigation

User info:

Favourite classes before joiing UTK:

cds_to_peptide.py

Sequence Translation (for each file):

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`cds_to_peptide.py`

Packages