Skip to content

Sh1ne111/KMERIA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

KMERIA

A KMER-based genome-wIde Assocation testing approach on polyploids

Table of Contents

Introduction

This repository contains an implementation of a k-mer-based method for Genome-Wide Association Studies (GWAS) in complex polyploid organisms (e.g., sugarcane, potato, sweetpotato, alfalfa,...). The approach is equally applicable to diploid species. By leveraging k-mer abundance profiles and statistical modeling, the method identifies associations between genetic variants and phenotypic traits.

Features

  • Enhanced Genetic Variability Detection: KMERIA can capture a wider range of genetic variants, including structural variations and copy number variations, which are often overlooked in traditional GWAS.

  • Independent of Reference Genomes: KMERIA do not rely on a reference genome in steps to identify genotypes, making them suitable for organisms with complex and variable genomic architectures, such as auto-polyploids.

  • Improved Additive effect Estimation: The analysis of k-mer copy number can provide more efficient estimates of additive effects in auto-polyploid species, allowing for better interpretation of genotype-phenotype relationships.

  • Facilitated Genotype Identification: KMERIA reduce the complexity of identifying genotypes in polyploids, facilitating faster and more efficient association analyses.

Recent updates

  • KMERIA Version 2.0.1 (2025.10.30):

    • K-mer matrix construction is now more efficient and consumes fewer resources;
    • Updated filter step to use new compressed output format;
    • Enhanced m2b step with BGZF compression and statistics;
    • Updated the association step to use our newly implemented Association tool bimbamAsso
  • KMERIA Version 0.0.1 (2024.10.14) is no longer be maintained

Prerequisites

  • C/C++ compiler
  • GNU make
  • Linux system

Installation

   
   # Clone the KMERIA repository:
   git clone https://github.com/Sh1ne111/KMERIA.git

   # To avoid GNU C++ Runtime Library conflicts, you can create a conda virtual environment to ensure all dependent libraries are installed correctly.
   conda env create -f kmeria_env.yml
   conda activate kmeriaenv 

   # htslib
   export LD_LIBRARY_PATH=/your_path/KMERIA/lib:$LD_LIBRARY_PATH

   # Change Permissions
   chmod 755 /your_path/KMERIA/bin/*
   chmod 755 /your_path/KMERIA/external_tools/*
   chmod 755 /your_path/KMERIA/bimbamAsso/*

   #Add PATH environment
   export PATH=/your_path/KMERIA/bin:/your_path/KMERIA/bimbamAsso:/your_path/KMERIA/external_tools:$PATH

   
   # For source code installations
 #  cd /your_path/KEMRIA/
 #  make && make install
 #  make clean

Quick Start

KMERIA provides a wrapper script, kmeria_wrapper.pl, designed to generate job scripts for the entire analysis pipeline, with built-in support for SLURM, SGE, and PBS schedulers. To facilitate the execution of a complete KMERIA analysis, we strongly recommend using this script as the entry point for workflow management.

perl /KMERIA/scripts/kmeria_wrapper.pl --step all \
  --input /path/to/fastq_files \
  --output /path/to/kmeria_results \
  --samples sample.list \
  --threads 32 \
  --kmer 31 \
  --min-abund 5 \
  --max-abund 1000 \
  --batch-size 2 \
  --use-kmc \   # Optional, default: kmeria count
  --kmc-memory 32 \
  --ploidy 4 \
  --depth-file /path/to/sample_depths.txt \
  --pheno /path/to/phenotypes.txt \
  --pheno-col 1 \
  --use-bimbam-tools \   # Optional: Use built-in 'bimbamAsso' instead of 'gemma'
  --scheduler slurm \
  --queue hebhcnormal01

➡️ Full Pipeline and Documentation

For detailed, step-by-step instructions, parameter explanations, and advanced usage, please visit our comprehensive KMERIA Wiki.

Command Overview

#===============================================================================#
#                                                                               #
#                 _  ____  __ ______ _____  _____                               #
#                | |/ /  \/  |  ____|  __ \|_   _|   /\                         #
#                | ' /| \  / | |__  | |__) | | |    /  \                        #
#                |  < | |\/| |  __| |  _  /  | |   / /\ \                       #
#                | . \| |  | | |____| | \ \ _| |_ / ____ \                      #
#                |_|\_|_|  |_|______|_|  \_\_____/_/    \_\                     #
#                                                                               #
#===============================================================================#

Program:  KMERIA - A KMER-based genome-wIde Association testing approach
          for polyploids

Version:  v2.0.1 (2025-10-14)
Author:   Chen Shuai <[email protected]>
GitHub:   https://github.com/Sh1ne111/KMERIA

Usage:    kmeria <command> [options]

Commands:

  Data Processing:
    count      Count k-mers from FASTA/FASTQ files
    dump       Convert binary k-mer file to plain text
    kctm       Build population k-mer counting matrix
    filter     Filter k-mer matrix by frequency and quality

  Format Conversion:
    m2b        Convert k-mer matrix to BIMBAM dosage format
    b2g        Convert BIMBAM format to genotype format

  Analysis:
    sketch     Random sampling for PCA and kinship calculation
    asso       Conduct k-mer genome-wide association study

  Utilities:
    fkr        Fetch reads associated k-mers from FASTQ files
    fkrtgs     Fetch reads associated k-mers from TGS FASTQ files
    kbam       Extract reads associated k-mers from BAM files
    addp       Annotate BAM with association p-values

Additional Help:
    kmeria <command> -h     Show detailed help for specific command
    Visit https://github.com/Sh1ne111/KMERIA for documentation

#===========================================================================#
#  Citation: If you use KMERIA, please cite our paper at [Journal/DOI]      #
#===========================================================================#

Miscellaneous Tools

KMERIA also includes several utility scripts located in the /bin and /scripts directories:

  • /bin/retrieve_kmer: Get k-mer dosage from filtered k-mer counting matrices.
  • /scripts/calc_gwas_threshold_new.R: Calculate the GWAS significance threshold.
  • /scripts/plot_manhattan.R: Helper script for plotting Manhattan plots.

Usage instructions are available on the Wiki.

Contact

For questions or feedback, please contact [Chen Shuai] at [[email protected]].

FAQs

Should I use kmeria count or KMC?
    Use kmeria count (default) for: 
           - Most standard analyses 
           - Direct KMERIA pipeline integration
    Use KMC (--use-kmc) for: 
            - Very large datasets (>100GB per sample) 
            - When you need strict abundance filtering 
            - Compatibility with other KMC-based workflows
            - Faster
    Consider: 
            - Shorter k-mers: More sensitive, more false positives, less memory 
            - Longer k-mers: More specific, fewer false positives, more memory

How do I process paired-end reads?
    Both methods automatically detect and process paired-end files: 
            - Files matching: sample_R1.fq.gz and sample_R2.fq.gz 
            - Or: sample_1.fq.gz and sample_2.fq.gz

Can I restart a failed pipeline?
    Yes! Since each step generates independent job scripts: 
    1. Identify which step failed (check log files) 
    2. Fix the issue (add memory, correct input files, etc.) 
    3. Re-run only that specific step: --step count|kctm|filter|m2b|asso
    4. Continue with subsequent steps

How do I speed up association analysis?
    The association step handles internal parallelism: 
     - Use --threads to set concurrency (e.g., 64) 
     - Ensure fast I/O (SSD storage) 
     - Pre-compute kinship and covariates
     Choose tool mode with --use-bimbam-tools for bimbamAsso mode.

Citation

If you have used KMERIA in your research, please cite below:

https://github.com/Sh1ne111/KMERIA

Shuai Chen et al. A k-mer-based GWAS approach empowering gene mining in polyploids, 05 November 2025, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-7347406/v1]