Skip to content

Commit eb65182

Browse files
committed
Generalized medoid trees.
1 parent 880a833 commit eb65182

File tree

9 files changed

+192
-75
lines changed

9 files changed

+192
-75
lines changed

.github/workflows/main.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@ jobs:
135135
136136
- name: medoid + ${{matrix.tree}} (non-default params)
137137
run: |
138-
./famsa -medoidtree -gt ${{matrix.tree}} -gt_export -subtree_size 10 -sample_size 100 -cluster_fraction 0.2 -cluster_iters 1 ${INPUT} medoid-${{matrix.tree}}-params.dnd
138+
./famsa -medoidtree -gt ${{matrix.tree}} -gt_export -subtree_size 10 -sample_size 100 -medoid_threshold 100 -cluster_fraction 0.2 -cluster_iters 1 ${INPUT} medoid-${{matrix.tree}}-params.dnd
139139
cmp medoid-${{matrix.tree}}-params.dnd ${REF_DIR}/medoid-${{matrix.tree}}-params.dnd
140140
141141
########################################################################################

.github/workflows/self-hosted.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -272,7 +272,7 @@ jobs:
272272
273273
- name: medoid + ${{matrix.tree}} (non-default params)
274274
run: |
275-
${{matrix.wsl}} ./famsa.${{matrix.compiler}} -medoidtree -gt ${{matrix.tree}} -gt_export -subtree_size 10 -sample_size 100 -cluster_fraction 0.2 -cluster_iters 1 ./test/${{matrix.INPUT}}/${{matrix.INPUT}} medoid-${{matrix.tree}}-params.dnd
275+
${{matrix.wsl}} ./famsa.${{matrix.compiler}} -medoidtree -gt ${{matrix.tree}} -gt_export -subtree_size 10 -sample_size 100 -medoid_threshold 100 -cluster_fraction 0.2 -cluster_iters 1 ./test/${{matrix.INPUT}}/${{matrix.INPUT}} medoid-${{matrix.tree}}-params.dnd
276276
${{matrix.wsl}} diff --strip-trailing-cr medoid-${{matrix.tree}}-params.dnd ./test/${{matrix.INPUT}}/medoid-${{matrix.tree}}-params.dnd
277277
278278
########################################################################################

README.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515
[![PyPI](https://img.shields.io/pypi/v/pyfamsa?label=PyFAMSA)](https://pypi.org/project/pyfamsa)
1616

1717
FAMSA2 is a progressive algorithm for large-scale multiple sequence alignments:
18-
* the entire Pfam-A v37.1 (~22 thousand families, ~62 million sequences) was analyzed in 8 hours,
18+
* the entire Pfam-A v37.0 (~22 thousand families, ~62 million sequences) was analyzed in 8 hours,
1919
* the family PF00005 of 3 million ABC transporters was aligned in 5 minutes and 18 GB of RAM.
2020

2121
## Overview and features
@@ -114,9 +114,9 @@ Options:
114114
* `-dist_export` - export a distance matrix to output file in CSV format
115115
* `-square_matrix` - generate a square distance matrix instead of a default triangle
116116
* `-pid` - calculate percent identity (the number of matching residues divided by the shorter sequence length) instead of distance
117-
* `-keep-duplicates` - keep duplicated sequences during alignment (default: disabled - duplicates are removed prior and restored after the alignment)
117+
* `-keep_duplicates` - keep duplicated sequences during alignment (default: disabled - duplicates are removed prior and restored after the alignment)
118118
* `-gz` - enable gzipped output (default: disabled)
119-
* `-gz-lev <value>` - gzip compression level [0-9] (default: 7)
119+
* `-gz_lev <value>` - gzip compression level [0-9] (default: 7)
120120
* `-trim_columns <fraction>` - remove columns with less than `fraction` of non-gap characters
121121
* `-refine_mode <on | off | auto>` - refinement mode (default: `auto` - the refinement is enabled for sets <= 1000 seq.)
122122

@@ -147,7 +147,7 @@ The major algorithmic features in FAMSA are:
147147
* The new heuristic based on K-Medoid clustering for generating fast guide trees. Medoid trees can be calculated in *O*(*N* log*N*) time and work with all types of subtrees (single linkage, UPGMA, NJ). The heuristic can be enabled with `-medoidtree` switch and allow aligning millions of sequences in minutes.
148148

149149
## Experimental results
150-
The analysis was performed on our extHomFam v37.1 benchmark produced by combining Homstrad references with Pfam v37.1 families (see Datasets section). The following algorithms were investigated:
150+
The analysis was performed on our extHomFam v37.0 benchmark produced by combining Homstrad references with Pfam v37.0 families (see Data sets section). The following algorithms were investigated:
151151

152152
| Name | Version | Command line |
153153
|---|---|---|
@@ -157,23 +157,23 @@ The analysis was performed on our extHomFam v37.1 benchmark produced by combinin
157157
| Muscle5 | 5.3 | `./muscle -super5 <input> -output <output> --threads 32` |
158158
| T-Coffee regressive | 13.46.0.919e8c6b | `./clustalo --threads=32 -i <input> --guidetree-out <guide_tree> --force -o /dev/null`<br>`./t_coffee -reg -reg_method mafftsparsecore_msa -reg_tree <guide_tree> -seq <input> -reg_nseq 100 -reg_thread 32 -outfile <output>` |
159159
| FAMSA | 1.1 | `./famsa -t 32 <input> <output>` |
160-
| FAMSA 2 | 2.4.1 | `./famsa -t 32 -gz <input> <output>` |
161-
| FAMSA 2 Medoid | 2.4.1 | `./famsa -t 32 -medoidtree -gz <input> <output>` |
160+
| FAMSA2 | 2.4.1 | `./famsa -t 32 -gz <input> <output>` |
161+
| FAMSA2 Medoid | 2.4.1 | `./famsa -t 32 -medoidtree -gz <input> <output>` |
162162

163163

164-
The tests were performed with 32 computing threads on a machine with AMD Epyc 9554 CPU and 1152 GiB (approx. 1237 GB) of RAM. We measured a fraction of properly aligned residue pairs and columns (SP and TC scores, respectively) as well as a total running time and a peak memory usage. The results are presented in the figure below. Notches at boxplots indicate 95% confidence interval for median, triangle represent means. FAMSA 2 alignments were stored in gzip format (`-gz` switch).
164+
The tests were performed with 32 computing threads on a machine with AMD Epyc 9554 CPU and 1152 GiB (approx. 1237 GB) of RAM. We measured a fraction of properly aligned residue pairs and columns (SP and TC scores, respectively) as well as a total running time and a peak memory usage. The results are presented in the figure below. Notches at boxplots indicate 95% confidence interval for median, triangle represent means. FAMSA2 alignments were stored in gzip format (`-gz` switch).
165165

166166
![extHomFam-SP-comparison](./img/extHomFam.png)
167167

168168

169+
## Data sets
169170

171+
Data sets developed and used in the FAMSA2 study:
172+
* extHomFam v37.0: [https://doi.org/10.5281/zenodo.6524236](https://doi.org/10.5281/zenodo.6524236)
170173

171-
## Datasets
172-
173-
Benchmark data sets developed and used in the FAMSA study:
174-
* extHomFam: [https://doi.org/10.7910/DVN/BO2SVW](https://doi.org/10.7910/DVN/BO2SVW)
174+
Older data sets:
175175
* extHomFam 2: [https://zenodo.org/record/6524237](https://zenodo.org/record/6524237)
176-
* extHomFam v37.1:
176+
* extHomFam: [https://doi.org/10.7910/DVN/BO2SVW](https://doi.org/10.7910/DVN/BO2SVW)
177177

178178
## Citing
179179
[Deorowicz, S., Debudaj-Grabysz, A., Gudyś, A. (2016) FAMSA: Fast and accurate multiple sequence alignment of huge protein families.

src/core/params.cpp

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -200,11 +200,13 @@ bool CParams::parse(int argc, char** argv, bool& showExpert)
200200
gt_heuristic = GT::ClusterTree;
201201
}
202202

203-
findOption(params, "-medoid_threshold", heuristic_threshold);
204-
findOption(params, "-subtree_size", subtree_size);
205-
findOption(params, "-sample_size", sample_size);
206-
findOption(params, "-cluster_fraction", cluster_fraction);
207-
findOption(params, "-cluster_iters", cluster_iters);
203+
findOption(params, "-medoid_threshold", medoid.threshold);
204+
findOption(params, "-subtree_size", medoid.subtree_size);
205+
findOption(params, "-sample_size", medoid.sample_size);
206+
findOption(params, "-num_evals", medoid.num_evaluations);
207+
findOption(params, "-cluster_fraction", medoid.cluster_fraction);
208+
findOption(params, "-cluster_iters", medoid.cluster_iters);
209+
208210

209211
export_tree = findSwitch(params, "-gt_export");
210212
export_distances = findSwitch(params, "-dist_export");

src/core/params.h

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -82,13 +82,20 @@ class CParams
8282
GT::Heuristic gt_heuristic = GT::None;
8383
// Distance distance = Distance::indel_div_lcs;
8484
Distance distance = Distance::indel075_div_lcs;
85-
int heuristic_threshold = 0;
8685

8786
int guide_tree_seed = 0;
88-
int subtree_size = 100;
89-
int sample_size = 2000;
90-
float cluster_fraction = 0.1f;
91-
int cluster_iters = 2;
87+
88+
struct {
89+
int subtree_size = 100;
90+
int sample_size = 2000;
91+
int num_evaluations = 1;
92+
int threshold = 2000;
93+
94+
float cluster_fraction = 0.1f;
95+
int cluster_iters = 2;
96+
97+
} medoid;
98+
9299

93100
string guide_tree_in_file;
94101
bool export_distances = false;

src/core/version.h

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,18 @@ Authors: Sebastian Deorowicz, Agnieszka Debudaj-Grabysz, Adam Gudys
99
#ifndef _VERSION_H
1010
#define _VERSION_H
1111

12-
#define FAMSA_VER "2.4.1"
13-
#define FAMSA_DATE "2025-05-09"
12+
#define FAMSA_VER "2.5.0"
13+
#define FAMSA_DATE "2025-07-07"
1414
#define FAMSA_AUTHORS "S. Deorowicz, A. Debudaj-Grabysz, A. Gudys"
1515

1616
#endif
1717

1818
/*
1919
Version history:
20+
21+
2.5.0 (2025-07-07)
22+
- Updated seed selection in medoid clustering.
23+
2024
2.4.1 (2025-05-09)
2125
- Added AVX512 support.
2226
- Added gzipped inputs support.

src/msa.cpp

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ void CFAMSA::initScoreMatrix()
8383
void CFAMSA::adjustParams(int n_seqs)
8484
{
8585

86-
if ((params.gt_heuristic != GT::None) && (n_seqs < params.sample_size)) {
86+
if ((params.gt_heuristic != GT::None) && (n_seqs < params.medoid.threshold)) {
8787
params.gt_heuristic = GT::None;
8888
}
8989

@@ -179,7 +179,7 @@ std::shared_ptr<AbstractTreeGenerator> CFAMSA::createTreeGenerator(const CParams
179179
// part-tree versus medoid-tree
180180
shared_ptr<IClustering> clustering = (params.gt_heuristic == GT::PartTree)
181181
? nullptr
182-
: make_shared<CLARANS>(params.cluster_fraction, params.cluster_iters);
182+
: make_shared<CLARANS>(params.medoid.cluster_fraction, params.medoid.cluster_iters);
183183

184184
// local seed dumper
185185
class SeedDumper : public IFastTreeObserver {
@@ -205,9 +205,11 @@ std::shared_ptr<AbstractTreeGenerator> CFAMSA::createTreeGenerator(const CParams
205205
params.n_threads,
206206
params.instruction_set,
207207
dynamic_pointer_cast<IPartialGenerator>(gen),
208-
params.subtree_size,
209-
clustering,
210-
params.sample_size);
208+
params.medoid.subtree_size,
209+
params.medoid.sample_size,
210+
params.medoid.num_evaluations,
211+
params.medoid.threshold,
212+
clustering);
211213

212214
gen = ft;
213215

@@ -220,9 +222,11 @@ std::shared_ptr<AbstractTreeGenerator> CFAMSA::createTreeGenerator(const CParams
220222
params.n_threads,
221223
params.instruction_set,
222224
dynamic_pointer_cast<IPartialGenerator>(gen),
223-
params.subtree_size,
224-
clustering,
225-
params.sample_size);
225+
params.medoid.subtree_size,
226+
params.medoid.sample_size,
227+
params.medoid.num_evaluations,
228+
params.medoid.threshold,
229+
clustering);
226230

227231
gen = ft;
228232

0 commit comments

Comments
 (0)