-
Notifications
You must be signed in to change notification settings - Fork 3
Detailed Step by Step Tutorial
Chen Shuai edited this page Nov 12, 2025
·
8 revisions
Building k-mer count matrices from resequencing reads. First, you need to decide on the k-mer size. In general, the k-mer length should not exceed 31 bp and should be the same for all individuals. We recommend using 31-mers. In addition, the raw data needs to undergo quality control, and the input files should be organized for downstream analysis
# kmer counting
kmeria count sample1_r1/r2.fastq.gz -t 4 -C -o sample1_k31.bin
...
kmeria count sampleN_r1/r2.fastq.gz -t 4 -C -o sampleN_k31.bin
Document:
Usage: kmeria count [options] <input.fa|fa.gz|fq|fq.gz>
Options:
-k INT Length of k-mer [2-31] [default: 21]
-t INT Thread count [default: 4]
-C Count strands separately (no canonical)
-H FILE Histogram output file
-o FILE Result output [default: stdout]
-T Text output (instead of binary)
-p INT Partitioning bits [default: 16]
Advanced:
-c Compress homopolymers (experimental)
Examples:
kmeria count -k 21 -t 8 input.fq.gz -o results.bin
kmeria count -k 21 -C input.fq.gz -o results.bin
kmeria count -k 21 -H hist.txt input.fq.gz -o results.bin
kmeria count -k 21 -T input.fq.gz -o results.txt
- We also support the k-mer counting file using KMC (https://github.com/refresh-bio/KMC)
ls sampleN_*.fastq.gz > sampleN_input.txt
kmc -k31 -t4 -m32 -b -ci5 -cs1000 @sampleN_input.path sampleN_k31 .
kmc_tools transform sampleN_k31 sort sampleN_sort_k31
Usage: kmeria kctm [options] -i <input_file> -o <output_prefix>
Supported formats:
- KMC databases (sorted KMC3 format)
- KMERIA binary files (kdump format)
- Mixed: can combine both formats in one matrix
Required arguments:
-i, --input FILE File containing paths to databases (one per line)
-o, --output PREFIX Output file prefix
Output format options:
-t, --text Output text format (default: binary format)
-d, --delimiter STR Field delimiter for text format (default: tab)
--no-header Do not include header row in text format
Performance and memory options:
-j, --threads N Number of worker threads (default: auto-detect)
-b, --batch-size N Batch processing size (default: 1000)
-s, --block-size N K-mers per output file (default: 10000000)
--buffer-size N Write buffer threshold (default: 100000)
--queue-depth N Max queue depth for backpressure (default: 5)
--memory-efficient Enable memory-efficient mode (default: on)
Compression options (binary format only):
--no-compression Disable compression (default: enabled)
Note: Compression can reduce file size by 30-90%
Filtering options:
-m, --min-freq N Minimum frequency threshold (default: 1)
Other options:
-v, --verbose Verbose output mode with memory stats
-h, --help Show this help information
=== CONVERSION MODE ===
Convert binary k-mer matrix files to text format
Usage: kmeria kctm --convert [options]
Single file conversion:
-i, --input FILE Input binary file
-o, --output FILE Output text file
Batch conversion:
--input-dir DIR Input directory with binary files
--output-dir DIR Output directory for text files
-t, --threads N Parallel threads for batch mode (default: 1)
Examples:
# Build matrix from mixed formats
echo "sample1.kmc_pre" > databases.txt
or echo "sample2.bin" >> databases.txt
kmeria kctm -i databases.txt -o output_matrix
# Convert binary to text
kmeria kctm --convert -i matrix.0001.bin -o matrix.0001.txt -v
# @Prog: K-mer Matrix Filter with Compression Support
# @Version: v3.0.0
#
# Usage:
# kmeria filter [options]
#
# Required options:
# -i STR Directory of k-mer abundance matrices <input_dir>
# -o STR Output directory <output_dir>
# -d STR Sample depth file <sample_id\tsequence_depth[\tploidy]>
#
# Optional parameters:
# -t INT Number of threads (default: 8)
# -c INT Max abundance of k-mer (default: 1000)
# -s FLOAT Missing ratio threshold (default: 0.8)
# -p INT Genome ploidy (default: 4)
#
# Format control:
# --input-format STR Input format: auto|text|binary|compressed (default: auto)
# --output-format STR Output format: text|binary|compressed (default: text)
#
# Other options:
# -v Verbose output
# -h Show this help message
#
# Format Details:
# auto - Auto-detect input format (recommended)
# text - Plain text tab-delimited format
# binary - Uncompressed binary format
# compressed - Compressed binary format (30-90% space savings)
#
# Examples:
# # Process compressed files, output compressed format (preserves compression)
# kmeria filter -i matrices/ -o filtered/ -d depths.txt --output-format compressed -v
#
# # Process any format, output text
# kmeria filter -i matrices/ -o filtered/ -d depths.txt
#
# # Convert compressed to text while filtering
# kmeria filter -i matrices/ -o filtered/ -d depths.txt --input-format compressed
#
# # Process with custom parameters
# kmeria filter -i matrices/ -o filtered/ -d depths.txt -t 16 -c 500 -s 0.9 --output-format compressed
# @Prog: K-mer Matrix to BIMBAM Converter
# @Version: v3.0.0
#
# Usage:
# m2b [options]
#
# Required options:
# --in STR Directory of k-mer matrices
# --out STR Output directory for BIMBAM files
#
# Optional parameters:
# --threads INT Number of file processing threads (default: 8)
# --prefix STR Input file prefix filter (default: "filtered_")
#
# Format control:
# --input-format STR Input format: auto|text|binary|compressed (default: auto)
# --allele-info STR Allele information string (default: "X, Y")
#
# Normalization options:
# --no-normalize Disable value normalization (default: enabled with linear)
# --min-range FLOAT Minimum output value (default: 0.0)
# --max-range FLOAT Maximum output value (default: 2.0)
#
# --quantile-norm Enable quantile-based normalization (handles outliers)
# This method converts extreme values to boundaries:
# - Values ≤ lower quantile → min-range (0.0)
# - Values ≥ upper quantile → max-range (2.0)
# - Values in between → linear transformation [0, 2]
#
# --lower-quantile FLOAT Lower quantile threshold (default: 0.05, i.e., 5th percentile)
# Values below this are set to min-range
#
# --upper-quantile FLOAT Upper quantile threshold (default: 0.95, i.e., 95th percentile)
# Values above this are set to max-range
#
#
#
# BGZF Compression options:
# --no-compress Disable compression (default: BGZF enabled)
# --level INT Compression level 0-9 (default: 6)
# 0=no compression, 1=fastest, 9=best compression
# --bgzf-threads INT BGZF compression threads per file (default: 4)
# --buffer-size INT Write buffer size in KB (default: 128)
# Recommended: 64-256 KB for best performance
#
# Other options:
# --verbose Verbose output
# --stats Output detailed statistics
# --help Show this help message
#
# Examples:
# # Basic conversion with linear normalization (default)
# kmeria m2b --in filtered_matrices/ --out bimbam_output/
#
# # Quantile-based normalization (robust to outliers)
# kmeria m2b --in matrices/ --out bimbam/ --quantile-norm
#
# # Custom quantile thresholds (1st and 99th percentiles)
# kmeria m2b --in matrices/ --out bimbam/ --quantile-norm \
# --lower-quantile 0.01 --upper-quantile 0.99
#
# # Quantile normalization with custom output range
# kmeria m2b --in matrices/ --out bimbam/ --quantile-norm \
# --min-range -1.0 --max-range 1.0
#
# # High-throughput with quantile normalization
# kmeria m2b --in large_data/ --out output/ --quantile-norm \
# --threads 16 --bgzf-threads 8 --level 3 --verbose
#
# # No normalization (preserve original values)
# kmeria m2b --in matrices/ --out bimbam/ --no-normalize
#
# # Maximum compression with quantile normalization
# kmeria m2b --in matrices/ --out bimbam/ --quantile-norm --level 9
#
# # No compression (maximum speed) with quantile normalization
# kmeria m2b --in matrices/ --out bimbam/ --quantile-norm --no-compressFormat:
#
# Output Format:
# - With BGZF (default): .bimbam.gz files (BGZF format)
# - Without compression: .bimbam files (plain text)
# Random sampling of k-mers for PCA and kinship calculation (~0.1% (10,000,000) of total k-mers).
for i in bimbam/*.bimbam.gz; do kmeria sketch -n 2000 $i >> sampling_kmer.bimbam; done
kmeria b2g -i sampling_kmer.bimbam -s <sample_list> -o sampling.geno
b2g This command is useful, if you want to calculate the PCA and kinship on sampling k-mers using external software such as PLINK or GEMMA program.
plink --vcf sampling.geno --make-bed --out sampling.geno
gemma -bfile sampling.geno -gk -p phenotype.tsv -o kinship
# @Prog: KmersGWAS
# @Version: v2.1.0
#
# Usage:
# asso [options]
#
# Required options:
# -i, --input DIR Directory containing BIMBAM format files
# -p, --pheno FILE Phenotype file
# -o, --output DIR Output directory for results
#
# Tool selection:
# --tool STR Tool to use: bimbamAsso | gemma
#
# Optional files:
# -c, --covar FILE Covariate file
# -k, --kinship FILE Kinship/relatedness matrix
# -s, --sample FILE Sample file (for bimbamAsso tool)
# --auto-sample Auto-generate sample file from genotype data
#
# Analysis options:
# -n, --ncol INT Phenotype column number (default: 1)
# -m, --method STR Analysis method for GEMMA: lmm|bslmm|loco (default: lmm)
# --maf FLOAT Minor allele frequency threshold (default: 0.01)
# --miss FLOAT Missing data threshold (default: 0.05)
# --no-kinship Don't use kinship matrix even if provided
#
# Bimbam-specific options:
# --bimbam-gzip Input bimbam files are gzipped
# --gen-kinship Generate kinship matrix before association
# --kin-method INT Kinship method: 1=IBS+mean, 2=IBS+random, 3=BN (default: 3)
# --kin-precision INT Kinship precision digits (default: 10)
# --out-precision INT Output precision digits (default: 5)
# --start-marker INT Start marker index (default: 0)
# --end-marker INT End marker index (default: all)
# --write-eigen Write eigenvalue/eigenvector files
# --disable-gls Disable GLS, use OLS instead
#
# Performance options:
# -t, --threads INT Number of parallel threads (default: 8)
# --dry-run Show commands without executing
#
# Quality control:
# --no-validate Skip input file validation
# --no-check-deps Skip dependency checking
#
# Output options:
# --verbose Verbose output and progress reporting
# --compress Compress output files
# --no-cleanup Keep temporary files
#
# Other options:
# -h, --help Show this help message
#
# Examples:
#
# # Using kmeria association tools (bimbamAsso) with pre-computed kinship
# kmeria asso --tool bimbamAsso -i bimbam_files/ -p pheno.txt \
# -s samples.txt -k kinship.txt --bimbam-gzip -t num_threads -o results/
#
#########################################################################################################################