Skip to content

Detailed Step by Step Tutorial

Chen Shuai edited this page Nov 12, 2025 · 8 revisions

Step by step usage

(1) Count k-mers for each individual separately

Building k-mer count matrices from resequencing reads. First, you need to decide on the k-mer size. In general, the k-mer length should not exceed 31 bp and should be the same for all individuals. We recommend using 31-mers. In addition, the raw data needs to undergo quality control, and the input files should be organized for downstream analysis

# kmer counting
kmeria count sample1_r1/r2.fastq.gz -t 4 -C -o sample1_k31.bin
...
kmeria count sampleN_r1/r2.fastq.gz -t 4 -C -o sampleN_k31.bin

Document:

Usage: kmeria count [options] <input.fa|fa.gz|fq|fq.gz>

Options:
  -k INT     Length of k-mer [2-31] [default: 21]
  -t INT     Thread count [default: 4]
  -C         Count strands separately (no canonical)
  -H FILE    Histogram output file
  -o FILE    Result output [default: stdout]
  -T         Text output (instead of binary)
  -p INT     Partitioning bits [default: 16]

Advanced:
  -c         Compress homopolymers (experimental)

Examples:
  kmeria count -k 21 -t 8 input.fq.gz -o results.bin
  kmeria count -k 21 -C input.fq.gz -o results.bin
  kmeria count -k 21 -H hist.txt input.fq.gz -o results.bin
  kmeria count -k 21 -T input.fq.gz -o results.txt
ls sampleN_*.fastq.gz > sampleN_input.txt
kmc -k31 -t4 -m32 -b -ci5 -cs1000 @sampleN_input.path sampleN_k31 . 

kmc_tools transform sampleN_k31 sort sampleN_sort_k31

(2) Create the k-mers count matrices at the population level.

Usage: kmeria kctm [options] -i <input_file> -o <output_prefix>

Supported formats:
  - KMC databases (sorted KMC3 format)
  - KMERIA binary files (kdump format)
  - Mixed: can combine both formats in one matrix

Required arguments:
  -i, --input FILE        File containing paths to databases (one per line)
  -o, --output PREFIX     Output file prefix

Output format options:
  -t, --text              Output text format (default: binary format)
  -d, --delimiter STR     Field delimiter for text format (default: tab)
  --no-header             Do not include header row in text format

Performance and memory options:
  -j, --threads N         Number of worker threads (default: auto-detect)
  -b, --batch-size N      Batch processing size (default: 1000)
  -s, --block-size N      K-mers per output file (default: 10000000)
  --buffer-size N         Write buffer threshold (default: 100000)
  --queue-depth N         Max queue depth for backpressure (default: 5)
  --memory-efficient      Enable memory-efficient mode (default: on)

Compression options (binary format only):
  --no-compression        Disable compression (default: enabled)
                          Note: Compression can reduce file size by 30-90%

Filtering options:
  -m, --min-freq N        Minimum frequency threshold (default: 1)

Other options:
  -v, --verbose           Verbose output mode with memory stats
  -h, --help              Show this help information

=== CONVERSION MODE ===
Convert binary k-mer matrix files to text format

Usage: kmeria kctm --convert [options]

Single file conversion:
  -i, --input FILE        Input binary file
  -o, --output FILE       Output text file

Batch conversion:
  --input-dir DIR         Input directory with binary files
  --output-dir DIR        Output directory for text files
  -t, --threads N         Parallel threads for batch mode (default: 1)

Examples:
  # Build matrix from mixed formats
  echo "sample1.kmc_pre" > databases.txt
  or echo "sample2.bin" >> databases.txt
  kmeria kctm -i databases.txt -o output_matrix

  # Convert binary to text
  kmeria kctm --convert -i matrix.0001.bin -o matrix.0001.txt -v

(3) Invalid k-mers filtration and correction.

# @Prog:              K-mer Matrix Filter with Compression Support                                                                               
# @Version:           v3.0.0
#
# Usage: 
#             kmeria filter [options]
#
# Required options:
#             -i  STR     Directory of k-mer abundance matrices <input_dir>
#             -o  STR     Output directory <output_dir>
#             -d  STR     Sample depth file <sample_id\tsequence_depth[\tploidy]>
#
# Optional parameters:
#             -t  INT     Number of threads (default: 8)
#             -c  INT     Max abundance of k-mer (default: 1000)
#             -s  FLOAT   Missing ratio threshold (default: 0.8)
#             -p  INT     Genome ploidy (default: 4)
#
# Format control:
#             --input-format STR   Input format: auto|text|binary|compressed (default: auto)
#             --output-format STR  Output format: text|binary|compressed (default: text)
#
# Other options:
#             -v          Verbose output
#             -h          Show this help message
#
# Format Details:
#             auto        - Auto-detect input format (recommended)
#             text        - Plain text tab-delimited format
#             binary      - Uncompressed binary format
#             compressed  - Compressed binary format (30-90% space savings)
#
# Examples:
#             # Process compressed files, output compressed format (preserves compression)
#             kmeria filter -i matrices/ -o filtered/ -d depths.txt --output-format compressed -v
#
#             # Process any format, output text
#             kmeria filter -i matrices/ -o filtered/ -d depths.txt
#
#             # Convert compressed to text while filtering
#             kmeria filter -i matrices/ -o filtered/ -d depths.txt --input-format compressed
#
#             # Process with custom parameters
#             kmeria filter -i matrices/ -o filtered/ -d depths.txt -t 16 -c 500 -s 0.9 --output-format compressed

(4) Convert k-mer count matrices to BIMBAM dosage format

# @Prog:              K-mer Matrix to BIMBAM Converter
# @Version:           v3.0.0
#
# Usage: 
#             m2b [options]
#
# Required options:
#             --in  STR       Directory of k-mer matrices
#             --out STR       Output directory for BIMBAM files
#
# Optional parameters:
#             --threads INT       Number of file processing threads (default: 8)
#             --prefix STR        Input file prefix filter (default: "filtered_")
#
# Format control:
#             --input-format STR  Input format: auto|text|binary|compressed (default: auto)
#             --allele-info STR   Allele information string (default: "X, Y")
#
# Normalization options:
#             --no-normalize      Disable value normalization (default: enabled with linear)
#             --min-range FLOAT   Minimum output value (default: 0.0)
#             --max-range FLOAT   Maximum output value (default: 2.0)
#
#             --quantile-norm     Enable quantile-based normalization (handles outliers)
#                                 This method converts extreme values to boundaries:
#                                 - Values ≤ lower quantile → min-range (0.0)
#                                 - Values ≥ upper quantile → max-range (2.0)
#                                 - Values in between → linear transformation [0, 2]
# 
#             --lower-quantile FLOAT  Lower quantile threshold (default: 0.05, i.e., 5th percentile)
#                                      Values below this are set to min-range
# 
#             --upper-quantile FLOAT  Upper quantile threshold (default: 0.95, i.e., 95th percentile)
#                                      Values above this are set to max-range
# 
#
#
# BGZF Compression options:
#             --no-compress       Disable compression (default: BGZF enabled)
#             --level INT         Compression level 0-9 (default: 6)
#                                 0=no compression, 1=fastest, 9=best compression
#             --bgzf-threads INT  BGZF compression threads per file (default: 4)
#             --buffer-size INT   Write buffer size in KB (default: 128)
#                                 Recommended: 64-256 KB for best performance
#
# Other options:
#             --verbose           Verbose output
#             --stats             Output detailed statistics
#             --help              Show this help message
#
# Examples:
#             # Basic conversion with linear normalization (default)
#             kmeria m2b --in filtered_matrices/ --out bimbam_output/
#
#             # Quantile-based normalization (robust to outliers)
#             kmeria m2b --in matrices/ --out bimbam/ --quantile-norm
#
#             # Custom quantile thresholds (1st and 99th percentiles)
#             kmeria m2b --in matrices/ --out bimbam/ --quantile-norm \
#                   --lower-quantile 0.01 --upper-quantile 0.99
#
#             # Quantile normalization with custom output range
#             kmeria m2b --in matrices/ --out bimbam/ --quantile-norm \
#                   --min-range -1.0 --max-range 1.0
#
#             # High-throughput with quantile normalization
#             kmeria m2b --in large_data/ --out output/ --quantile-norm \
#                   --threads 16 --bgzf-threads 8 --level 3 --verbose
#
#             # No normalization (preserve original values)
#             kmeria m2b --in matrices/ --out bimbam/ --no-normalize
#
#             # Maximum compression with quantile normalization
#             kmeria m2b --in matrices/ --out bimbam/ --quantile-norm --level 9
#
#             # No compression (maximum speed) with quantile normalization
#             kmeria m2b --in matrices/ --out bimbam/ --quantile-norm --no-compressFormat:
#
# Output Format:
#             - With BGZF (default): .bimbam.gz files (BGZF format)
#             - Without compression: .bimbam files (plain text)

(5) Q + K calculation (population stratification and kinship)

# Random sampling of k-mers for PCA and kinship calculation (~0.1% (10,000,000) of total k-mers).

for i in bimbam/*.bimbam.gz; do kmeria sketch -n 2000 $i >> sampling_kmer.bimbam; done

kmeria b2g -i sampling_kmer.bimbam -s <sample_list> -o sampling.geno

b2g This command is useful, if you want to calculate the PCA and kinship on sampling k-mers using external software such as PLINK or GEMMA program.

plink --vcf sampling.geno --make-bed --out sampling.geno
gemma -bfile sampling.geno -gk -p phenotype.tsv -o kinship

(6) k-mer-based assocation studies.

# @Prog:              KmersGWAS
# @Version:           v2.1.0
#
# Usage: 
#             asso [options]
#
# Required options:
#             -i, --input DIR          Directory containing BIMBAM format files
#             -p, --pheno FILE         Phenotype file
#             -o, --output DIR         Output directory for results
#
# Tool selection:
#             --tool STR               Tool to use: bimbamAsso | gemma
#
# Optional files:
#             -c, --covar FILE         Covariate file
#             -k, --kinship FILE       Kinship/relatedness matrix
#             -s, --sample FILE        Sample file (for bimbamAsso tool)
#             --auto-sample            Auto-generate sample file from genotype data
#
# Analysis options:
#             -n, --ncol INT           Phenotype column number (default: 1)
#             -m, --method STR         Analysis method for GEMMA: lmm|bslmm|loco (default: lmm)
#             --maf FLOAT              Minor allele frequency threshold (default: 0.01)
#             --miss FLOAT             Missing data threshold (default: 0.05)
#             --no-kinship             Don't use kinship matrix even if provided
#
# Bimbam-specific options:
#             --bimbam-gzip            Input bimbam files are gzipped
#             --gen-kinship            Generate kinship matrix before association
#             --kin-method INT         Kinship method: 1=IBS+mean, 2=IBS+random, 3=BN (default: 3)
#             --kin-precision INT      Kinship precision digits (default: 10)
#             --out-precision INT      Output precision digits (default: 5)
#             --start-marker INT       Start marker index (default: 0)
#             --end-marker INT         End marker index (default: all)
#             --write-eigen            Write eigenvalue/eigenvector files
#             --disable-gls            Disable GLS, use OLS instead
#
# Performance options:
#             -t, --threads INT        Number of parallel threads (default: 8)
#             --dry-run                Show commands without executing
#
# Quality control:
#             --no-validate            Skip input file validation
#             --no-check-deps          Skip dependency checking
#
# Output options:
#             --verbose                Verbose output and progress reporting
#             --compress               Compress output files
#             --no-cleanup             Keep temporary files
#
# Other options:
#             -h, --help               Show this help message
#
# Examples:
#             
#             # Using kmeria association tools (bimbamAsso) with pre-computed kinship
#             kmeria asso --tool bimbamAsso -i bimbam_files/ -p pheno.txt \
#                       -s samples.txt -k kinship.txt --bimbam-gzip -t num_threads -o results/
#
#########################################################################################################################