This repository contains an implementation of a k-mer-based method for Genome-Wide Association Studies (GWAS) in complex polyploid organisms (e.g., sugarcane, potato, sweetpotato, alfalfa,...). The approach is equally applicable to diploid species. By leveraging k-mer abundance profiles and statistical modeling, the method identifies associations between genetic variants and phenotypic traits.
-
Enhanced Genetic Variability Detection: KMERIA can capture a wider range of genetic variants, including structural variations and copy number variations, which are often overlooked in traditional GWAS.
-
Independent of Reference Genomes: KMERIA do not rely on a reference genome in steps to identify genotypes, making them suitable for organisms with complex and variable genomic architectures, such as auto-polyploids.
-
Improved Additive effect Estimation: The analysis of k-mer copy number can provide more efficient estimates of additive effects in auto-polyploid species, allowing for better interpretation of genotype-phenotype relationships.
-
Facilitated Genotype Identification: KMERIA reduce the complexity of identifying genotypes in polyploids, facilitating faster and more efficient association analyses.
-
KMERIA Version 2.0.1 (2025.10.30):
- K-mer matrix construction is now more efficient and consumes fewer resources;
- Updated filter step to use new compressed output format;
- Enhanced m2b step with BGZF compression and statistics;
- Updated the association step to use our newly implemented Association tool bimbamAsso
-
KMERIA Version 0.0.1 (2024.10.14) is no longer be maintained
- C/C++ compiler
- GNU make
- Linux system
# Clone the KMERIA repository:
git clone https://github.com/Sh1ne111/KMERIA.git
# To avoid GNU C++ Runtime Library conflicts, you can create a conda virtual environment to ensure all dependent libraries are installed correctly.
conda env create -f kmeria_env.yml
conda activate kmeriaenv
# htslib
export LD_LIBRARY_PATH=/your_path/KMERIA/lib:$LD_LIBRARY_PATH
# Change Permissions
chmod 755 /your_path/KMERIA/bin/*
chmod 755 /your_path/KMERIA/external_tools/*
chmod 755 /your_path/KMERIA/bimbamAsso/*
#Add PATH environment
export PATH=/your_path/KMERIA/bin:/your_path/KMERIA/bimbamAsso:/your_path/KMERIA/external_tools:$PATH
# For source code installations
# cd /your_path/KEMRIA/
# make && make install
# make cleanKMERIA provides a wrapper script, kmeria_wrapper.pl, designed to generate job scripts for the entire analysis pipeline, with built-in support for SLURM, SGE, and PBS schedulers. To facilitate the execution of a complete KMERIA analysis, we strongly recommend using this script as the entry point for workflow management.
perl /KMERIA/scripts/kmeria_wrapper.pl --step all \
--input /path/to/fastq_files \
--output /path/to/kmeria_results \
--samples sample.list \
--threads 32 \
--kmer 31 \
--min-abund 5 \
--max-abund 1000 \
--batch-size 2 \
--use-kmc \ # Optional, default: kmeria count
--kmc-memory 32 \
--ploidy 4 \
--depth-file /path/to/sample_depths.txt \
--pheno /path/to/phenotypes.txt \
--pheno-col 1 \
--use-bimbam-tools \ # Optional: Use built-in 'bimbamAsso' instead of 'gemma'
--scheduler slurm \
--queue hebhcnormal01For detailed, step-by-step instructions, parameter explanations, and advanced usage, please visit our comprehensive KMERIA Wiki.
- Pipeline (Easy Mode): Detailed breakdown of the
kmeria_wrapper.plparameters. - Detailed Step-by-Step Tutorial: A complete walkthrough of the entire KMERIA workflow, from raw reads to association results.
- Post-GWAS Analysis: Guides on mapping associated k-mers and reads.
- Retrieve k-mer dosage: Retrieve k‐mer dosage from the k‐mer counting matrices.
#===============================================================================#
# #
# _ ____ __ ______ _____ _____ #
# | |/ / \/ | ____| __ \|_ _| /\ #
# | ' /| \ / | |__ | |__) | | | / \ #
# | < | |\/| | __| | _ / | | / /\ \ #
# | . \| | | | |____| | \ \ _| |_ / ____ \ #
# |_|\_|_| |_|______|_| \_\_____/_/ \_\ #
# #
#===============================================================================#
Program: KMERIA - A KMER-based genome-wIde Association testing approach
for polyploids
Version: v2.0.1 (2025-10-14)
Author: Chen Shuai <[email protected]>
GitHub: https://github.com/Sh1ne111/KMERIA
Usage: kmeria <command> [options]
Commands:
Data Processing:
count Count k-mers from FASTA/FASTQ files
dump Convert binary k-mer file to plain text
kctm Build population k-mer counting matrix
filter Filter k-mer matrix by frequency and quality
Format Conversion:
m2b Convert k-mer matrix to BIMBAM dosage format
b2g Convert BIMBAM format to genotype format
Analysis:
sketch Random sampling for PCA and kinship calculation
asso Conduct k-mer genome-wide association study
Utilities:
fkr Fetch reads associated k-mers from FASTQ files
fkrtgs Fetch reads associated k-mers from TGS FASTQ files
kbam Extract reads associated k-mers from BAM files
addp Annotate BAM with association p-values
Additional Help:
kmeria <command> -h Show detailed help for specific command
Visit https://github.com/Sh1ne111/KMERIA for documentation
#===========================================================================#
# Citation: If you use KMERIA, please cite our paper at [Journal/DOI] #
#===========================================================================#
KMERIA also includes several utility scripts located in the /bin and /scripts directories:
/bin/retrieve_kmer: Get k-mer dosage from filtered k-mer counting matrices./scripts/calc_gwas_threshold_new.R: Calculate the GWAS significance threshold./scripts/plot_manhattan.R: Helper script for plotting Manhattan plots.
Usage instructions are available on the Wiki.
For questions or feedback, please contact [Chen Shuai] at [[email protected]].
Should I use kmeria count or KMC?
Use kmeria count (default) for:
- Most standard analyses
- Direct KMERIA pipeline integration
Use KMC (--use-kmc) for:
- Very large datasets (>100GB per sample)
- When you need strict abundance filtering
- Compatibility with other KMC-based workflows
- Faster
Consider:
- Shorter k-mers: More sensitive, more false positives, less memory
- Longer k-mers: More specific, fewer false positives, more memory
How do I process paired-end reads?
Both methods automatically detect and process paired-end files:
- Files matching: sample_R1.fq.gz and sample_R2.fq.gz
- Or: sample_1.fq.gz and sample_2.fq.gz
Can I restart a failed pipeline?
Yes! Since each step generates independent job scripts:
1. Identify which step failed (check log files)
2. Fix the issue (add memory, correct input files, etc.)
3. Re-run only that specific step: --step count|kctm|filter|m2b|asso
4. Continue with subsequent steps
How do I speed up association analysis?
The association step handles internal parallelism:
- Use --threads to set concurrency (e.g., 64)
- Ensure fast I/O (SSD storage)
- Pre-compute kinship and covariates
Choose tool mode with --use-bimbam-tools for bimbamAsso mode.
If you have used KMERIA in your research, please cite below:
https://github.com/Sh1ne111/KMERIA
Shuai Chen et al. A k-mer-based GWAS approach empowering gene mining in polyploids, 05 November 2025, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-7347406/v1]
