Code repository associated with Brierley et al., 'An AI for AI..'

This repository contains supporting data, code, and result files associated with Brierley et al. (2025), "An AI for an AI: identifying zoonotic potential of avian influenza viruses via genomic machine learning."

This work takes a large set of avian influenza virus genome sequences from both avian and human hosts (indicative of zoonotic spillover), clusters them, calculates various genomic and proteomic feature representations, and trains supervised machine learning models to identify zoonotic sequences.

Additional large data files containing the genome sequence feature sets and final trained stack ensemble models are available at . These files should nest within the folders herein when downloaded.

scripts/

data_scripts/

00_startup_data_process.R sets options for sequence data processing, and calls the remaining R scripts in turn.
01_functions.R defines custom functions for data processing and cleaning
02_process_GISAID_NCBI_data.R filters and cleans sequence data and corresponding metadata.
03_calc_feats.R calculates features from sequences:
- overlapping k-mers of segment nucleotide sequence (2 ≤ k ≤ 6);
- genome composition of coding sequences (nucleotide bias, dinucleotide bias, Relative Synonymous Codon Usage, amino acid bias)
- processes pre-calculated protein feature sequences (see below)
04_cluster_seqs.R calls a separate installation of MMseqs2 to cluster whole genome sequences by shared identity
05_process_clusts.R formats clustering outputs from MMseqs2 and reselects cluster representatives for mixed-label clusters.
protein_feat_extract.py calls iFeatureOmega-CLI to calculate functional physicochemistry measures of protein sequences (Conjoint Triad, Composition-Transition-Distribution, Pseudo-Amino Acid Composition) according to parameters of protein_params.json.

ml_scripts/

01a-01f R scripts construct individual machine learning models (12 feature sets * 8 gene/proteins * 13 holdout sets) to predict zoonotic status using five different binary classification algorithms, which are parallelised by default.
The exception is 01e_create_training_fold_indices.R which extracts the previously defined folds for cross-validation during parameter optimisation, to supply to XGBoost separately in 01f_build_xgb.R
02_evaluate_validate.R loads all individual models and evaluates performance on holdout subtypes.
03_stack_weighted_model.R constructs a LASSO logistic regression "stack", or meta-learner model, using inputs from the best individual models as new features, before evaluating performance on holdout subtypes
04_stack_weighted_varimp.R calculates variable importance by permuting each raw feature used within models within the stack one-by-one; note this can require lengthy computation and is best split into a batch process along the vector varnames
05_figs_tables.R generates the figures in folder "figures_tables"

data/

fold_indices_list.rds defines 5-fold cross-validation data folds for consistency between model runs
allflu_wgs_ref.csv defines ID, source, host label, and date for all sequences considered for analysis.
holout_clusters/ contains cluster members and cluster representative IDs selected when sequences were clustered with different parameter sets and excluding different holdout subtypes.

analysis/

Analytical result outputs describing:

grid search optimisation of individual model hyperparameters
performance of individual models on held out test subtypes
performance of stack ensemble models on held out test subtypes
permutation variable importance of genomic and proteomic features

figures_tables/

Contains figures as used in manuscript.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
S3		S3
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code repository associated with Brierley et al., 'An AI for AI..'

scripts/

data_scripts/

ml_scripts/

data/

analysis/

figures_tables/

About

Uh oh!

Releases

Packages

Languages

lbrierley/ai_for_ai

Folders and files

Latest commit

History

Repository files navigation

Code repository associated with Brierley et al., 'An AI for AI..'

scripts/

data_scripts/

ml_scripts/

data/

analysis/

figures_tables/

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages