This repository provides tools to:
-
Generate Integrated Consensus (
integrate_consensus.py)
Produces a high-quality viral consensus by strategically using ABACAS sequences to fill missing regions in the mapping consensus. It employs a sliding-window approach that verifies the evolutionary plausibility of ABACAS content against empirical priors before incorporation. -
Build Evolutionary Priors (
build_priors.py)
Constructs empirical prior distributions from large multiple-sequence alignments. These priors model expected genetic variation across genomic windows and provide likelihood thresholds for quality control during consensus integration. -
Access Supporting Utilities (
utilsscripts)
Provides modular helper functions for alignment processing, window scoring, and consensus construction used by both main workflows.
pip install priorcons# Create priors
priorcons build-priors --input sequences.fasta --ref REF_ID --output priors.parquet
# Run consensus integration
priorcons integrate-consensus --input alignment.aln --ref REF_ID --prior priors.parquet --output_dir results
This is the entrypoint of the tool. It creates a integrated consensus sequence by combining mapping consensus and ABACAS output, both aligned to a reference sequence, but only after performing quality control (QC) at the window level.
--inputโ path to an alignment file (.aln) containing at least:- 1ยบ Reference sequence
- 2ยบ Mapping consensus sequence
- 3ยบ ABACAS consensus sequence
The sequences in the alignment file must be provided in the specified order, as they will be identified by their position.
-
--refโ ID of the reference sequence in the alignment. -
--priorโ path to a priors table (.parquet) generated withbuild_priors.py. -
--output_dirโ directory to save the results.
- Start with mapping consensus as the baseline
- Identify missing/unreliable regions in mapping consensus
- For each window:
- If mapping has coverage โ keep mapping sequence
- If mapping has missing data โ evaluate ABACAS for that window:
- Check fragmentation and quality
- Verify evolutionary plausibility using priors (nLL score)
- If ABACAS passes QC โ use ABACAS to fill missing regions
- Construct final consensus combining mapping baseline with validated ABACAS fills
- Restore mapping-specific insertions
- QC reporting: compute coverage, substitutions, and insertion metrics comparing the final integrated consensus to MAPPING.
The script produces three files inside --output_dir:
-
Integrated consensus FASTA
- File:
<basename>-INTEGRATED.fasta - Contains the final consensus sequence after merging and reinserting insertions.
- File:
-
Window QC trace (CSV)
- File:
windows_trace.csv - One row per window, recording:
start,endโ genomic coordinates.MISSING_MAPPING,MISSING_ABACASโ counts of missing bases.ABACAS_MORE_INFOโ whether ABACAS has fewer missing bases than MAPPING.ABACAS_FRAGMENTSโ fragmentation level of ABACAS in this window (keep: 0 < n fragments < 3 ).WINDOW_PRIOR_nLL_p95โ threshold from priors.WINDOW_SCORE_nLLโ score of ABACAS in this window.WINDOW_QC_PASSEDโ True/False decision.
- File:
-
Consensus QC summary (JSON)
- File:
qc.json - Provides overall metrics comparing the MAPPING consensus and the integrated consensus:
MAPPING_COVERAGEโ % of genome covered in MAPPING.FINAL_COVERAGEโ % of genome covered in integrated consensus.MAPPING_SUBSTITUTIONSโ substitutions vs. reference in MAPPING.FINAL_SUBSTITUTIONSโ substitutions vs. reference in integrated consensus.EXPECTED_SUBSTITUTIONSโ expected number of substitutions, extrapolated from mapping.OBS-EXP_SUBSTITUTIONSโ difference between observed and expected substitutions.N_INSERTIONSโ number of insertions added back.TOTAL_INSERTIONS_LENGTHโ total inserted length.INSERTIONSโ list of insertions with their coordinates.
- File:
python integrate_consensus.py \
--input /path/to/<sample_name>.aln \
--ref RSV_BD \
--prior /path/to/RSVBD_win100_ovlp50_priors.parquet \
--output_dir resultsThis will generate:
results/<sample_name>-INTEGRATED.fastaresults/windows_trace.csvresults/qc.json
This script creates empirical priors (overlapped windows) from a large multiple sequence alignment.
These priors are later used by integrate_consensus.py to evaluate windows.
-i / --inputโ aligned FASTA file with multiple sequences.-r / --refโ ID of the reference sequence.-o / --outputโ output file (.parquet).--winโ window size (default: 100).--overlapโ overlap size (default: 10).
python build_priors.py \
-i alignment.fasta \
-r ReferenceID \
-o priors.parquet \
--win 100 \
--overlap 10A .parquet file with one row per window, containing:
start,endโ window coordinates.nLL_p95,nLL_p99โ empirical thresholds.profileโ base probability distributions for each position in the window.
For each window of size W bases (e.g., W = 100), and for each position j within that window, we compute the probability of observing each nucleotide:
Where:
= number of sequences with base
at position
.
= pseudocount (Laplace smoothing, default
) to avoid zero probabilities.
- Bases
Nare ignored in the counts.
This gives a per-position categorical distribution.
Given a query sequence , we compute its probability under the window profile.
For each valid (non-N) position with observed base
:
The normalized negative log-likelihood (nLL) is:
Where:
Smaller nLL values indicate sequences more likely under the empirical profile.
To characterize "normal variation" for each window:
- Score all sequences from the alignment against the window profile.
- Collect the distribution of nLL values.
- Extract percentiles (e.g., 95th and 99th) to serve as thresholds.
Thus, for each window we store:
- The distribution (profile).
- Empirical thresholds:
nLL_p95andnLL_p99.
A new sequence can later be compared:
- If
nLL < nLL_p95โ typical. - If
nLL > nLL_p99โ unusually variable, possibly unreliable region.
Several utility scripts provide reusable functions for both processes:
-
utils.py โ basic alignment and scoring functions:
load_alignment,extract_ref_positions,sliding_windows,score_window.
-
utils_integrate_consensus.py โ additional helpers for consensus integration:
- missingness and fragmentation counts,
- insertion handling,
- QC calculations,
- consensus merging,
- window evaluation wrapper.
These modular functions keep the pipeline clean and reusable.
This tool allows also an analyis QC.
The input is a directory with all the results folders (one for each sample) there are stored the qc files:
- input: priorcons_path / / qc_files
It also need a gtf file and an outdir.
priorcons qc --input_dir /path/to/results/PRIORCONS/ \
--gff_file /path/to/rsv.gff \
--output_dir /path/to/output_dir_plotshttps://zenodo.org/records/17454552/files/PriorCons_Test_data.zip?download=1