Skip to content

Commit ecdbc86

Browse files
authored
Merge pull request #16 from JLSteenwyk/accel
Accel
2 parents 48f8fbe + 6623f93 commit ecdbc86

File tree

44 files changed

+1103
-89
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+1103
-89
lines changed

.github/workflows/ci.yml

Lines changed: 3 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ jobs:
55
runs-on: macos-latest
66
strategy:
77
matrix:
8-
python-version: ["3.9", "3.10", "3.11"]
8+
python-version: ["3.9", "3.10", "3.11", "3.12", "3.13"]
99
steps:
1010
- uses: actions/checkout@master
1111
- name: Set up Python ${{ matrix.python-version }}
@@ -17,30 +17,24 @@ jobs:
1717
pip install -r requirements.txt
1818
# make orthosnap CLI available for tests
1919
make install
20-
# install test dependencies
21-
pip install pytest
22-
pip install pytest-cov
2320
- name: Run tests
2421
run: |
2522
make test.fast
2623
test-full:
2724
runs-on: macos-latest
2825
env:
29-
PYTHON: "3.10"
26+
PYTHON: "3.13"
3027
steps:
3128
- uses: actions/checkout@master
3229
- name: Setup Python
3330
uses: actions/setup-python@master
3431
with:
35-
python-version: "3.10"
32+
python-version: "3.13"
3633
- name: Install dependencies
3734
run: |
3835
pip install -r requirements.txt
3936
# make orthosnap CLI available for tests
4037
make install
41-
# install test dependencies
42-
pip install pytest
43-
pip install pytest-cov
4438
- name: Generate coverage report
4539
run: |
4640
make test.coverage

AGENTS.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Repository Guidelines
2+
3+
## Project Structure & Module Organization
4+
Core source lives in `orthosnap/`, with the CLI entry point in `orthosnap/orthosnap.py` and shared utilities split across helper modules (for example `helper.py`, `writer.py`). Tests mirror runtime paths under `tests/` (`unit/` for isolated logic, `integration/` for CLI flows) and reuse fixtures in `tests/samples/`. Documentation is authored with Sphinx in `docs/`; build artifacts land in `docs/_build/` during CI. Reusable data for demonstrations ships in `samples/`.
5+
6+
## Build, Test, and Development Commands
7+
Create an isolated environment (`python -m venv .venv && source .venv/bin/activate`) before installing. Run `make install` to install the package locally so the `orthosnap` console script is available. Use `make test.fast` for the default CI-equivalent quick check, or `make test.unit` and `make test.integration` when focusing on a specific layer. `make test.coverage` produces `unit.coverage.xml` and `integration.coverage.xml` for upload to Codecov; clean any generated FASTA outputs before committing.
8+
9+
## Coding Style & Naming Conventions
10+
Target Python 3.9–3.13 and follow PEP 8: four-space indentation, snake_case functions, PascalCase classes, and descriptive module-level constants. Keep functions small, prefer explicit variable names (e.g., `taxa_counts` over `tc`), and include short docstrings for public helpers. Align logging and user-visible messages with existing phrasing in `orthosnap/helper.py`. When adding CLI arguments, route parsing through `parser.py` to remain consistent.
11+
12+
## Testing Guidelines
13+
Pytest drives all suites. Mark unit-only tests with the default markers and integration cases with `@pytest.mark.integration`. Place new unit tests beside the feature under `tests/unit/` and use fixtures from `tests/samples/` where possible. Run `python -m pytest -k <pattern>` to iterate during development, and ensure `make test` succeeds before opening a pull request. Aim to keep coverage steady; add regression tests for every CLI flag or branch condition.
14+
15+
## Commit & Pull Request Guidelines
16+
Write concise, present-tense commit messages (e.g., `add delimiter argument`) and group related changes. Reference issues in the body using `Fixes #<id>` when relevant. For pull requests, summarise behaviour changes, call out new dependencies, and attach command outputs or screenshots when UX changes. Confirm CI passes, ensure docs stay accurate (update Sphinx sources when CLI flags change), and request at least one review before merging.
17+
18+
## Environment & Documentation Tips
19+
Use `requirements.txt` for runtime and packaging deps (supports CPython 3.9 through 3.13) and install doc tooling via `pipenv` inside `docs/` when updating the Sphinx site (`cd docs && pipenv run make html`). Avoid checking in files under `docs/_build/` or temporary trees left by integration tests.
20+
21+
## Performance Improvements In Progress
22+
- Replace whole-tree `copy.deepcopy` calls in `orthosnap/orthosnap.py` with subtree-level cloning so we only touch the relevant clade per iteration.
23+
- Precompute a parent lookup for tip nodes to let `get_subtree_tips` reuse cached structure instead of copying the tree for every duplicate gene check.
24+
- Track `assigned_tips` as a set throughout execution to skip repeated list-to-set conversions while filtering already-handled sequences.

orthosnap/helper.py

Lines changed: 34 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77

88
from Bio import Phylo
99
from Bio import SeqIO
10-
from Bio.Phylo.BaseTree import TreeMixin
10+
from Bio.Phylo.BaseTree import TreeMixin, Tree
1111

1212

1313
class InparalogToKeep(Enum):
@@ -40,24 +40,10 @@ def determine_if_dups_are_sister(
4040
# get first set of subtree tips
4141
# first_set_of_subtree_tips = subtree_tips[0]
4242
# first_set_of_subtree_tips = subtree_tips
43-
# set if duplicate sequences are sister as True
44-
are_sisters = True
45-
# create a copy of the tree
46-
dup_tree = copy.deepcopy(newtree)
43+
ancestor = newtree.common_ancestor(subtree_tips)
44+
ancestor_tips = {term.name for term in ancestor.get_terminals()}
4745

48-
dup_tree = dup_tree.common_ancestor(subtree_tips)
49-
_, all_tips = get_all_tips_and_taxa_names(dup_tree, delimiter)
50-
if set(all_tips) != set(subtree_tips):
51-
are_sisters = False
52-
53-
# # check if duplicate sequences are sister
54-
# for set_of_subtree_tips in subtree_tips[1:]:
55-
# if first_set_of_subtree_tips != set_of_subtree_tips:
56-
# are_sisters = False
57-
# if not are_sisters:
58-
# break
59-
60-
return are_sisters
46+
return ancestor_tips == set(subtree_tips)
6147

6248

6349
def get_all_tips_and_taxa_names(tree, delimiter: str):
@@ -83,6 +69,19 @@ def get_all_tips_and_taxa_names(tree, delimiter: str):
8369
return taxa, all_tips
8470

8571

72+
def build_tip_parent_lookup(tree):
73+
"""Return a mapping from terminal name to its parent clade."""
74+
75+
tip_parent = dict()
76+
77+
for parent in tree.find_clades(order="preorder"):
78+
for child in parent.clades:
79+
if child.is_terminal() and child.name is not None:
80+
tip_parent[child.name] = parent
81+
82+
return tip_parent
83+
84+
8685
def check_if_single_copy(taxa: list, all_tips: list):
8786
"""
8887
check if the input phylogeny is already a single-copy tree
@@ -115,7 +114,7 @@ def get_tips_and_taxa_names_and_taxa_counts_from_subtrees(inter, delimiter: str)
115114
return taxa_from_terms, terms, counts_of_taxa_from_terms, counts
116115

117116

118-
def get_subtree_tips(terms: list, name: str, tree):
117+
def get_subtree_tips(terms: list, name: str, tip_parent_lookup: dict):
119118
"""
120119
get lists of subsubtrees from subtree
121120
"""
@@ -124,43 +123,36 @@ def get_subtree_tips(terms: list, name: str, tree):
124123
subtree_tips = []
125124
# for individual sequence among duplicate sequences
126125
for dup in dups:
127-
# create a copy of the tree
128-
temptree = copy.deepcopy(tree)
129-
# get the node path for the duplicate sequence
130-
node_path = temptree.get_path(dup)
131-
# for the terminals of the parent of the duplicate sequence
132-
# get the terminal names and append them to temp
133-
temp = []
134-
for term in node_path[-2].get_terminals():
135-
temp.append(term.name)
126+
parent = tip_parent_lookup.get(dup)
127+
if parent is None:
128+
continue
129+
temp = [term.name for term in parent.get_terminals()]
136130
subtree_tips.append(temp)
137131

138132
return subtree_tips, dups
139133

140134

141135
def handle_multi_copy_subtree(
142-
all_tips: list,
136+
subtree,
143137
terms: list,
144-
newtree,
145138
subgroup_counter: int,
146139
fasta: str,
147140
support: float,
148141
fasta_dict: dict,
149-
assigned_tips: list,
142+
assigned_tips: set,
150143
counts_of_taxa_from_terms,
151-
tree,
152144
snap_trees: bool,
153145
inparalog_to_keep: InparalogToKeep,
154146
output_path: str,
155147
inparalog_handling: dict,
156148
inparalog_handling_summary: dict,
157149
delimiter: str,
150+
tip_parent_lookup: dict,
158151
):
159152
"""
160153
handling case where subtree contains all single copy genes
161154
"""
162-
# prune subtree to get subtree of interest
163-
newtree = prune_subtree(all_tips, terms, newtree)
155+
newtree = Tree(root=copy.deepcopy(subtree))
164156

165157
# collapse bipartition with low support
166158
newtree = collapse_low_support_bipartitions(newtree, support)
@@ -171,7 +163,9 @@ def handle_multi_copy_subtree(
171163
# if the taxon is represented by more than one sequence
172164
if counts_of_taxa_from_terms[name] > 1:
173165
# get subtree tips
174-
_, dups = get_subtree_tips(terms, name, tree)
166+
_, dups = get_subtree_tips(terms, name, tip_parent_lookup)
167+
if not dups:
168+
continue
175169

176170
# check if subtrees are sister to one another
177171
# are_sisters = determine_if_dups_are_sister(subtree_tips)
@@ -217,14 +211,13 @@ def handle_multi_copy_subtree(
217211

218212

219213
def handle_single_copy_subtree(
220-
all_tips: list,
214+
subtree,
221215
terms: list,
222-
newtree,
223216
subgroup_counter: int,
224217
fasta: str,
225218
support: float,
226219
fasta_dict: dict,
227-
assigned_tips: list,
220+
assigned_tips: set,
228221
snap_trees: bool,
229222
output_path: str,
230223
inparalog_handling: dict,
@@ -233,8 +226,7 @@ def handle_single_copy_subtree(
233226
"""
234227
handling case where subtree contains all single copy genes
235228
"""
236-
# prune subtree to get subtree of interest
237-
newtree = prune_subtree(all_tips, terms, newtree)
229+
newtree = Tree(root=copy.deepcopy(subtree))
238230

239231
# collapse bipartition with low support
240232
newtree = collapse_low_support_bipartitions(newtree, support)
@@ -361,7 +353,7 @@ def write_output_fasta_and_account_for_assigned_tips_single_copy_case(
361353
subgroup_counter: int,
362354
terms: list,
363355
fasta_dict: dict,
364-
assigned_tips: list,
356+
assigned_tips: set,
365357
snap_tree: bool,
366358
newtree,
367359
output_path: str,
@@ -376,7 +368,7 @@ def write_output_fasta_and_account_for_assigned_tips_single_copy_case(
376368
with open(output_file_name, "w") as output_handle:
377369
for term in terms:
378370
SeqIO.write(fasta_dict[term], output_handle, "fasta")
379-
assigned_tips.append(term)
371+
assigned_tips.add(term)
380372

381373
if snap_tree:
382374
output_file_name = (

orthosnap/orthosnap.py

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
#!/usr/bin/env python
22

3-
import copy
43
import os
54
import re
65
import sys
@@ -11,6 +10,7 @@
1110
from .args_processing import process_args
1211
from .helper import (
1312
check_if_single_copy,
13+
build_tip_parent_lookup,
1414
get_all_tips_and_taxa_names,
1515
get_tips_and_taxa_names_and_taxa_counts_from_subtrees,
1616
handle_single_copy_subtree,
@@ -68,6 +68,8 @@ def execute(
6868
# read input files and midpoint root tree
6969
tree, fasta_dict = read_input_files(tree, fasta, rooted)
7070

71+
tip_parent_lookup = build_tip_parent_lookup(tree)
72+
7173
# get list of all tip names and taxa names
7274
taxa, all_tips = get_all_tips_and_taxa_names(tree, delimiter)
7375

@@ -79,7 +81,7 @@ def execute(
7981
# loop through tree, but skip the root (hence [1:])
8082
# keep tabs of terms that have already been assigned
8183
# to a subgroup as well as a counter for that subgroup
82-
assigned_tips = []
84+
assigned_tips = set()
8385
subgroup_counter = 0
8486

8587
inparalog_handling = dict()
@@ -95,8 +97,7 @@ def execute(
9597
inter, delimiter
9698
)
9799

98-
# create a copy of the input tree
99-
newtree = copy.deepcopy(tree)
100+
terms_set = set(terms)
100101

101102
# if a sufficient number of taxa are represented, examine the subtree
102103
if len(counts_of_taxa_from_terms) >= occupancy:
@@ -105,14 +106,13 @@ def execute(
105106
# prune tips not part of the subtree of interest
106107
if (
107108
set([1]) == set(counts)
108-
and len(list(set(terms) & set(assigned_tips))) == 0
109+
and assigned_tips.isdisjoint(terms_set)
109110
):
110111
subgroup_counter, assigned_tips, \
111112
inparalog_handling, inparalog_handling_summary = \
112113
handle_single_copy_subtree(
113-
all_tips,
114+
inter,
114115
terms,
115-
newtree,
116116
subgroup_counter,
117117
fasta,
118118
support,
@@ -127,26 +127,25 @@ def execute(
127127
# if any taxon is represented by more than one sequence and
128128
# the tips have not been assigned to a suborthogroup
129129
# prune tips not part of the subtree of interest
130-
elif len(list(set(terms) & set(assigned_tips))) == 0:
130+
elif assigned_tips.isdisjoint(terms_set):
131131
subgroup_counter, assigned_tips, \
132132
inparalog_handling, inparalog_handling_summary = \
133133
handle_multi_copy_subtree(
134-
all_tips,
134+
inter,
135135
terms,
136-
newtree,
137136
subgroup_counter,
138137
fasta,
139138
support,
140139
fasta_dict,
141140
assigned_tips,
142141
counts_of_taxa_from_terms,
143-
tree,
144142
snap_trees,
145143
inparalog_to_keep,
146144
output_path,
147145
inparalog_handling,
148146
inparalog_handling_summary,
149147
delimiter,
148+
tip_parent_lookup,
150149
)
151150

152151
if report_inparalog_handling:

orthosnap/version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "1.3.1"
1+
__version__ = "1.4.0"

requirements.txt

Lines changed: 16 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,16 @@
1-
biopython==1.82
2-
coverage==7.3.2
3-
Cython==3.0.4
4-
exceptiongroup==1.1.3
5-
iniconfig==2.0.0
6-
mock==5.1.0
7-
numpy==1.24.0
8-
packaging==23.2
9-
pluggy==1.5.0
10-
pytest==8.1.1
11-
pytest-cov==4.1.0
12-
pytest-mock==3.0.0
13-
tomli==2.0.1
14-
tqdm==4.66.1
1+
biopython>=1.85
2+
coverage>=7.5
3+
Cython>=3.0.10
4+
exceptiongroup>=1.2.0
5+
iniconfig>=2.0.0
6+
mock>=5.1.0
7+
numpy>=1.24,<2.1; python_version < "3.10"
8+
numpy>=2.1; python_version >= "3.10"
9+
packaging>=23.2
10+
pluggy>=1.5.0
11+
pytest>=8.2
12+
pytest-cov>=5.0
13+
pytest-mock>=3.12
14+
tomli>=2.0.1; python_version < "3.11"
15+
tqdm>=4.66.1
16+
setuptools>=68
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
>Aspergillus_niger_CBS_513.88|XP_001390755.1
2+
MSNLRDSETSSQISQTLEADVEKCPGQLPSMGQGKVFPPALPNREDYVVDFDGPDDPEHP
3+
FNWPLPTKLLISMIACFATFIPSFASGVFAPGSEEAAKAFNVGSEVGILGTTLFVLGFAS
4+
GPVLWAPASELFGRRWPLTLGLLGGAVFTITSAVSKDIQTLIICRFFAGMFGASPLAVVP
5+
AVLSDIWNNSHRGAAISVYALFVFVGPLSAPFIGGFITASSLGWRWTLYIPAFVGFAWGA
6+
ISVFFLRETYASCLLLSKAVTLRRLTGNWGVHAKQEEVEVDIQQLIQKYFTRPLRMLVTE
7+
PIILVISLYMSFIYGIVYALLEAYPYVFESVYGMSMGIDGLPFIGIIIGQLAACGFILSQ
8+
QSAYVKKLAANNNVPIPEWRLETVVIGAPVFTAGIFWFSWTGFTASIHWMAPTAAGVLIG
9+
FGILCIFLPCFNYLVDSFLPVAASTVAANIILRSSVAAGFPLFSKQMFANLGVQWAGTLL
10+
GCLSAIMIPIPFLFRAYGPRLRGEGRQNSR
11+
>Aspergillus_awamori_IFM_58123|GCB21319.1
12+
MSNLRDSETSSQISQTLEADVEKCPGQLPSMGQGKVFPPALPNREDYVVDFDGPDDPEHP
13+
FNWPLSTKLLISTIACFATFIPSFASGIFAPGAEEAAKAFNVGSEVGILGTTLFVLGFAS
14+
GPVLWAPASELFGRRWPLTLGLLGGAVFTIASAVSKDIQTLIICRFFAGMFGASPLAVVP
15+
AVLSDIWNNSHRGAAISVYALFVFVGPLSAPFIGGFITASSLGWRWTLYIPAFVGFAWGA
16+
ISVFFLRETYASCLLLSKAATLRRLTGNWGVHAKQEEVEVDIQQLIQKYFTRPLRMLVTE
17+
PIILVISLYMSFIYGIVYALLEAYPYVFESVYGMSMGINSLPFIGIIIGQLAACGFILSQ
18+
QSAYVKKLVANNNVPIPEWRLEIVVIGAPVFTAGIFWFSWTGFTASIHWMAPTAAGVLIG
19+
AASTVAANIILRSSVAAGFPLFSKQMFANLGVQWAGTLLGCLSAIMIPIPFLFRAYGPRL
20+
RGEGRQNSR
21+
>Aspergillus_oerlinghausenensis_CBS139183|A_oerling_CBS139183_05698-RA
22+
MESPKLYESSTPLSQMSNSKENIRPFNPAQEITDSLQHQLVVGSSHGSTKKDNRPLTIGA
23+
GKSFPPPLVDPDNYVVEFDGPDDPEHPYNWSYFLKLRISAIACLGTLTASFTSAIFAPGA
24+
SHASKAFGVGLEVGTLGTTLYILGFASGPLIWAPTSELIGRRWPLTIGMLGVAIFTISAA
25+
VCKDIQTLIICRFFAGLFGASQLSVVPAVLSDVYNSSQRGTAITIYSLTVFGGPFSAPFI
26+
GGFISSSSLGWRWTLYISAFMGFASVSLFILLLKETYAPLLLVAKAAAIRQQTFNWGIHA
27+
KLDEVKVDVHELFHKYFTRPLKILITEPIVLLISLYMSFIYGLVYALLEAYPYVFESVYG
28+
MSPGVGGLPFIALIIGQLLACGFIITQNSSYAKKLAANGNVAVPEWRLSPAIVGAPVFTV
29+
GIFWFGWTSFTSRLHWMAPTAAGVLIGFGVLCIFLPCFNYLVDSYLPLAASTVAANIILR
30+
SAVAAGFPLFSKQMFERLGVQWAATLLGCLAATMIPIPLVFRAFGPLLRGKSRSAP

0 commit comments

Comments
 (0)