This repository contains supporting data, code, and result files associated with Brierley et al. (2025), "An AI for an AI: identifying zoonotic potential of avian influenza viruses via genomic machine learning."
This work takes a large set of avian influenza virus genome sequences from both avian and human hosts (indicative of zoonotic spillover), clusters them, calculates various genomic and proteomic feature representations, and trains supervised machine learning models to identify zoonotic sequences.
Additional large data files containing the genome sequence feature sets and final trained stack ensemble models are available at . These files should nest within the folders herein when downloaded.
00_startup_data_process.Rsets options for sequence data processing, and calls the remaining R scripts in turn.01_functions.Rdefines custom functions for data processing and cleaning02_process_GISAID_NCBI_data.Rfilters and cleans sequence data and corresponding metadata.03_calc_feats.Rcalculates features from sequences:- overlapping k-mers of segment nucleotide sequence (2 ≤ k ≤ 6);
- genome composition of coding sequences (nucleotide bias, dinucleotide bias, Relative Synonymous Codon Usage, amino acid bias)
- processes pre-calculated protein feature sequences (see below)
04_cluster_seqs.Rcalls a separate installation of MMseqs2 to cluster whole genome sequences by shared identity05_process_clusts.Rformats clustering outputs from MMseqs2 and reselects cluster representatives for mixed-label clusters.protein_feat_extract.pycalls iFeatureOmega-CLI to calculate functional physicochemistry measures of protein sequences (Conjoint Triad, Composition-Transition-Distribution, Pseudo-Amino Acid Composition) according to parameters ofprotein_params.json.
01a-01fR scripts construct individual machine learning models (12 feature sets * 8 gene/proteins * 13 holdout sets) to predict zoonotic status using five different binary classification algorithms, which are parallelised by default.- The exception is
01e_create_training_fold_indices.Rwhich extracts the previously defined folds for cross-validation during parameter optimisation, to supply to XGBoost separately in01f_build_xgb.R 02_evaluate_validate.Rloads all individual models and evaluates performance on holdout subtypes.03_stack_weighted_model.Rconstructs a LASSO logistic regression "stack", or meta-learner model, using inputs from the best individual models as new features, before evaluating performance on holdout subtypes04_stack_weighted_varimp.Rcalculates variable importance by permuting each raw feature used within models within the stack one-by-one; note this can require lengthy computation and is best split into a batch process along the vectorvarnames05_figs_tables.Rgenerates the figures in folder "figures_tables"
fold_indices_list.rdsdefines 5-fold cross-validation data folds for consistency between model runsallflu_wgs_ref.csvdefines ID, source, host label, and date for all sequences considered for analysis.holout_clusters/contains cluster members and cluster representative IDs selected when sequences were clustered with different parameter sets and excluding different holdout subtypes.
Analytical result outputs describing:
- grid search optimisation of individual model hyperparameters
- performance of individual models on held out test subtypes
- performance of stack ensemble models on held out test subtypes
- permutation variable importance of genomic and proteomic features
Contains figures as used in manuscript.