mim : A small auxiliary index (and parser) to massively speed up parallel parsing of gzipped FASTQ/A files
Tl;dr: Mim is a small 0.1% overhead index alongside .fastq.gz files that allows for
multithreaded decompression, speeding up analysis pipelines.
# Build and install mim
cargo install mim-index
# Build an index file `records.fastq.gz.mim`
# Takes about as long as decompression on a single thread.
mim build records.fastq.gz
# Decompress the file into 8 `.fastq.{0..8}` files in parallel.
mim unzip records.fastq.gz --parts 8
# Decompress into 8 named pipes instead, that can be ingested once.
mim unzip records.fastq.gz --parts 8 --pipeDisclaimer: while the .mim index format is stable for now, the CLI needs
further polishing. Feel free to report issues/suggestions.
Why mim? The project's name is a reference to the Norse figure Mímir, who is:
renowned for his knowledge and wisdom, who is beheaded during the Æsir–Vanir War. Afterward, the god Odin carries around Mímir's head and it recites secret knowledge and counsel to him.
the mim index is a small index that gives critical knowledge into the internal structure of a gzipped FASTA/Q file that allows rapid and efficient parallel parsing and decompression.
The purpose of mim is so that one can create a mim index for gzipped FASTQ files that they anticipate will be reprocessed more than once (e.g. either by themselves or by another party after being deposited in a public database like ENA or SRA). Having the mim index available make subsequent parsing of the data much faster, enabling more rapid re-analysis of data (e.g. when new versions of tools or even entirely different analysis algorithms become available).
The mim index is purely additive (i.e. creating it does not modify or rewrite any part of the original file), small (typically about 1/1000-th the size of the compressed input file), and takes about as much time to make as simply parsing the input. This makes it easy to create, store, transfer and share mim indexes.
Citation:
mim: A lightweight auxiliary index to enable fast, parallel, gzipped FASTQ parsing. Rob Patro, Siddhant Bharti, Prajwal Singhania, Rakrish Dhakal, Thomas J. Dahlstrom, Ragnar Groot Koerkamp. https://doi.org/10.1101/2025.11.24.690271
Install the mim binary using cargo install mim-index.
We require Rust version (MSRV) at least 1.91. If your Rust version is older than this, please upgrade by running rustup update.
The examples below assume an input file records.fastq.gz. .fasta.gz files
are also supported.
build: build the.mimindex file from a.fastx.gz.unzip: use the.mimindex for parallel decompression into parts or pipes.server: run a server that stores a content-addressed-storage ofmimfiles keyed by hashes ofgzfiles.upload: upload a.mimto a server.download: download the.mimfor a.gzfrom the server.
mim --help
Simple program to deal with mim files
Usage: mim <COMMAND>
Commands:
build Build the .mim index
unzip Parallel-unzip a .fastx.gz using the .mim index
info Print .mim file metadata
peek print some reads
nuc-hist print some reads
server Run the server
upload Upload a mim file
download Download a file
help Print this message or the help of the given subcommand(s)
Build a .fastq.gz.mim index file for a given .fastq.gz file.
# Write records.fastq.gz.mim:
mim build records.fastq.gz
# Write in custom location:
mim build records.fastq.gz -m /path/to/records.fastq.gz.mim
# Use a chunk-size of 1GB instead of the default 32MB
mim build records.fastq.gz --chunck-size 1000000000
# Add custom json-encoded metadata.
mim build mim build --help
Build the .mim index
Usage: mim build [OPTIONS] <FASTX_GZ>
Arguments:
<FASTX_GZ> Input .fastx.gz file
Options:
-m, --mim <INDEX_PATH> .mim file to write; default <FASTX_GZ>.mim
-c, --chunk-size <CHUNK_SIZE> Distance between checkpoints [default: 32000000]
-d, --metadata <METADATA> Optional metadata to add. Json-encoded string
Given a records.fastq.gz and .fastq.gz.mim, unzip using multiple threads into either:
- the plain
records.fastq:mim unzip records.fastq.gz - multiple
records.fastq.<ID>file parts:mim unzip records.fastq.gz --parts 8 - multiple
records.fastq.<ID>named piped:mim unzip records.fastq.gz --parts 8 --pipeThis will block until all pipes have been read to completion by some other program.
mim unzip --help
Parallel-unzip a .fastx.gz using the .mim index
Usage: mim unzip [OPTIONS] <FASTX_GZ>
Arguments:
<FASTX_GZ> Input .fastx.gz file
Options:
-m, --mim <INDEX_PATH> .mim file to use; default <FASTX_GZ>.mim
-o, --output <OUTPUT> Output path. .fastx or .fastx.<part_id> by default
-p, --parts <PARTS> The number of .fastx.<part_id> parts to write
-j, --threads <THREADS> Number of threads to use. Defaults to number of cores
--pipe Fork and create a named pipe instead of file for each part. Requires --parts
The .mim index stores a Blake3 hash of the corresponding .gz file that is
verified before decompressing it. This hash is also used to globally identify
.gz. With mim server, one can launch a long-running binary that reads all
.mim files in a local directory and serves them over a unix socket, keyed by
the hash.
Clients can then use mim upload records.fastq.gz.mim to upload the .mim for a local .gz file to
the server, so that others can later use mim download records.fastq.gz to
download the mim based on the hash of records.fastq.gz.
This needs two arguments: the path where it will create a unix socket to listen
to incoming requests, and the directory where (previously) uploaded .mim files
are hosted.
mim server --socket /tmp/mim-server --dir /mnt/data/mim-filesmim server --help
Run the server
Usage: mim server --socket <SOCKET> --dir <DIR>
Options:
--socket <SOCKET> Path to unix socket to listen on
--dir <DIR> The directory containing the .mim files
To upload a local .mim file, use:
mim upload --socket /tmp/mim-server records.fastq.gz.mim
mim upload --help
Upload a mim file
Usage: mim upload --socket <SOCKET> <MIM>
Arguments:
<MIM> The .mim file to upload
Options:
--socket <SOCKET> Path to unix socket to connect to
To download a .mim file for a local .gz file, run:
mim download --socket /tmp/mim-server records.fastq.gz.
Write it to a custom location using -m custom/records.fastq.gz.
mim download --help
Download a file
Usage: mim download [OPTIONS] --socket <SOCKET> <FASTX_GZ>
Arguments:
<FASTX_GZ> The .fastx.gz file to hash and download the .mim for
Options:
-m, --mim <INDEX_PATH> Output location of the .mim. Default <FASTX_GZ>.mim
--socket <SOCKET> Path to unix socket to connect to
Given an existing .mim file, multi-threaded file parsing works by instantiating
a MimReader with the number of workers, and then calling .readers() to
return a reader for each thread. Each reader returns a record-aligned range of byte.
use mim::MimReader;
let num_workers = 8;
// Requires `reords.fastq.gz.mim` to exist.
let reader: MimReader = mim::mim_reader(PathBuf::new("records.fastq.gz"), num_workers);
std::thread::scope(|s| {
for reader in reader.readers() {
let reader = reader.unwrap();
s.spawn(|| {
// Do something with the reader, e.g. read records from it.
});
}
});We also provide MimReader::get_needletail_parser and
MimReader::get_needletail_iter that directly wrap the reader in a Needletail
parser and iterator over records.
See <src/lib.rs> or docs.rs for details.
The original approach, which has been altered substantially, was implemented for a final project in the Spring 2025 edition of CMSC701 at UMD. The original implementation, from which this project eventually evolved, is available here.
