BioDemuX.jl is a Julia package designed for demultiplexing reads based on barcodes. Given a set of sequencing reads in FASTQ format and a reference barcode in CSV or TSV format, each read is assigned to a FASTQ file corresponding to its barcode. This barcode assignment process is designed to be robust to barcode mutations and allows you to adjust the permitted level of mutation through parameters.
- Fast and accurate semi-global alignment
- Robust to barcode mutations
- No restrictions on barcode size or position
- Usable as julia package
- Supports parallel computing
To be published.
- Dependencies
- Installation
- Basic Usage
- Tips to Speed Up Demultiplexing
- Options
- Example: How Barcode Length and Option Values Affect Classification
- How to run tests
- Dataset
- Support or Contact
- Julia >= 1.10 (includes the
Distributedstandard library) - Julia packages
- DataFrames >= 1.7.0
- CSV >= 0.10.14
- CodecZlib >= 0.7.0
- BufferedStreams >= 1.2.2
- Open the Julia REPL (by typing
juliain the terminal). - Press
]to enter the Pkg REPL mode. - Run the following command to install
BioDemuX.jl:add BioDemuX
- Press the
Backspacekey to exit the Pkg mode.
The primary function of this package is execute_demultiplexing(). It classifies sequences in a FASTQ file by aligning them with reference barcodes in barcode file. Usage is as follows:
using BioDemuX
execute_demultiplexing(FASTQ_file, barcode_file, output_directory)- There is no restriction on the sequence length in the FASTQ file.
- The input can be gzipped; the function automatically detects and processes the files accordingly.
- The function can take one or two FASTQ files as input. In the case of using two FASTQ files, the command can be executed as follows:
execute_demultiplexing(FASTQ_file1, FASTQ_file2, barcode_file, output_directory)When using two FASTQ files, sequences in the FASTQ_file2 are classified based on the alignment of the FASTQ_file1 sequences with the barcodes in the barcode reference file. Hence, the corresponding reads in both FASTQ files must be in the same order and present in equal numbers.
- The reference file is expected to be a CSV or TSV file containing the following columns:
ID,Full_seq,Full_annotation, as shown below:
ID Full_seq Full_annotation
001-barcode ACAGACUACAAA XXXBBBBBBBXX
-
In the
Full_seqcolumn, the region specified asBin theFull_annotationcolumn is considered as the barcode. -
Alternatively, a FASTA file of barcode sequences can be used as the reference. In this case, each sequence in the FASTA file is treated as a full barcode (the entire sequence is considered the barcode region) and the header line of each entry (without the
>prefix) is used as itsID.
- All output files will be saved in the specified
output_directory. - The output is gzipped depending on the input FASTQ format, or can be specified using the
gzip_outputoption. - The names of the output files are based on the filename of the FASTQ file as the prefix and the
IDvalues in the barcode reference file. For example, if the FASTQ filename issample.fastqand the reference file contains IDs such as001and002, the resulting output files will be namedsample.001.fastq,sample.002.fastq, and so on. You can freely change the prefix by specifying theoutput_prefixargument. - Sequences that do not match any barcode in the reference file are saved in
unknown.fastq. Sequences that have ambiguous classification (i.e., they match multiple barcodes with similar scores) are saved inambiguous_classification.fastq. These FASTQ files also have prefix likesample.unknown.fastqandsample.ambiguous_classification.fastq. - If the
output_directorydoes not exist, a new directory is created to store the output files.
BioDemuX.jl supports parallel computing using multi-threading and channel-based streaming IO, allowing faster processing of large datasets with efficient memory usage. To utilize parallel processing, follow the steps below:
To enable parallel processing, you need to start Julia with multiple threads. Use the -t (or --threads) flag followed by the number of desired threads:
./julia -t [number_of_threads]For example, to use 8 threads:
./julia -t 8You can use the -t auto option to automatically detect the number of available threads and start Julia with that number of threads.
You can also set the JULIA_NUM_THREADS environment variable.
export JULIA_NUM_THREADS=8Once Julia is started with multiple threads, BioDemuX.jl automatically utilizes them for reading, writing, and processing data in parallel. No additional setup is required.
BioDemuX.jl skips calculation of unnecessary path in DP matrix based on the settings of max_error_rate,mismatch. and indel. By setting lower max error rate or higher penalty, you can further increase computation speed.
The execute_demultiplexing function provides several optional parameters to control the demultiplexing process:
execute_demultiplexing(FASTQ_file, barcode_file, output_directory; barcode_file2=nothing, output_prefix="", gzip_output=nothing, max_error_rate=0.2, min_delta=0.0, mismatch=1, indel=1, nindel=nothing, bc_complement=false, bc_rev=false, ref_search_range="1:end", barcode_start_range="1:end", barcode_end_range="1:end", ref_search_range2="1:end", barcode_start_range2="1:end", barcode_end_range2="1:end", chunk_size=4000, channel_capacity=64, matching_algorithm=:semiglobal, log=false)-
max_error_rate::Float64(default:0.2):- This is the maximum allowed error rate for matching sequences to barcodes. It is multiplied by the barcode's length to calculate the total penalty score that can be tolerated. If the sequence's alignment penalty exceeds this limit for all barcodes, it will be saved in
unknown.fastq.
- This is the maximum allowed error rate for matching sequences to barcodes. It is multiplied by the barcode's length to calculate the total penalty score that can be tolerated. If the sequence's alignment penalty exceeds this limit for all barcodes, it will be saved in
-
matching_algorithm::Symbol(default::semiglobal):- Specifies the algorithm used for barcode matching.
:semiglobal: Uses semi-global alignment allowing for mismatches, insertions, and deletions. This is the default and most robust mode.:hamming: Uses Hamming distance, allowing only for mismatches (no insertions or deletions). Faster than semiglobal.:exact: Uses exact string matching, allowing no mismatches or indels. Fastest mode. Use only when data quality is perfect or strict filtering is desired.
- Specifies the algorithm used for barcode matching.
-
min_delta::Float64(default:0.0):- This defines the minimum difference in penalty scores needed to confidently assign a sequence to a barcode. It is multiplied by the barcode's length to determine the score difference required to avoid ambiguity. If the difference between the best match's penalty score and the second-best match's score is less than this threshold, the sequence is considered ambiguous and saved in
ambiguous_classification.fastq.
- This defines the minimum difference in penalty scores needed to confidently assign a sequence to a barcode. It is multiplied by the barcode's length to determine the score difference required to avoid ambiguity. If the difference between the best match's penalty score and the second-best match's score is less than this threshold, the sequence is considered ambiguous and saved in
-
mismatch::Int(default:1):- The penalty score for mismatches during sequence alignment. A higher value makes the alignment more strict for mismatches.
-
indel::Int(default:1):- The penalty score for insertions and deletions (indels) during sequence alignment. A higher value makes the alignment more strict for insertions or deletions.
-
nindel::Union{Int, Nothing}(default:nothing):- The penalty score for insertions and deletions (indels) when the base is 'N' (wildcard). This should only be specified when using 'N' in barcodes or reads. Note that enabling this option may slow down the demultiplexing process.
-
classify_both::Bool(default:false):- If set to
true, the function will classify sequences in bothFASTQ_file1andFASTQ_file2based on the alignment with sequences inFASTQ_file1and output separate files for each. Otherwise, it classifies only R2 sequences by default.
- If set to
-
bc_complement::Bool(default:false):- If set to true, the barcodes in the reference file are converted to their complementary sequences before alignment.
-
bc_rev::Bool(default:false):- If set to true, the barcodes in the reference file are reversed before alignment.
-
output_prefix1::String(default:""):- Specifies the prefix for the first set of output files when processing two FASTQ files. If not provided, the prefix defaults to the name of the
FASTQ_file1. By setting this option, you can customize the file names for the first set of outputs.
- Specifies the prefix for the first set of output files when processing two FASTQ files. If not provided, the prefix defaults to the name of the
-
output_prefix2::String(default:""):- Specifies the prefix for the second set of output files when processing two FASTQ files. If not provided, the prefix defaults to the name of the
FASTQ_file2. By setting this option, you can customize the file names for the second set of outputs.
- Specifies the prefix for the second set of output files when processing two FASTQ files. If not provided, the prefix defaults to the name of the
-
output_prefix::String(default:""):- Specifies the prefix for the output files when processing a single FASTQ file. If not provided, the prefix defaults to the name of the FASTQ file.
-
gzip_output::Bool(default:auto-detect):- Controls whether the output FASTQ files are compressed (gzipped). By default (when this option is not explicitly set), the output will be gzipped if the input FASTQ files have a
.gzextension, and uncompressed otherwise. You can setgzip_output=trueto force gzipped output files orgzip_output=falseto ensure output files are not compressed.
- Controls whether the output FASTQ files are compressed (gzipped). By default (when this option is not explicitly set), the output will be gzipped if the input FASTQ files have a
-
ref_search_range::String(default:"1:end"):- Specifies the range within the read sequence where the barcode search should be performed. The format is
"start:end", wherestartandendcan be integers (1-based index) or relative to the end of the sequence usingend(e.g.,"1:20","end-19:end"). This allows you to restrict the search to a specific region, improving performance and accuracy if the barcode position is known.
- Specifies the range within the read sequence where the barcode search should be performed. The format is
-
barcode_start_range::String(default:"1:end"):- Specifies the allowed range for the start position of the barcode alignment within the read. The format is the same as
ref_search_range. If the aligned barcode starts outside this range, it will not be considered a valid match.
- Specifies the allowed range for the start position of the barcode alignment within the read. The format is the same as
-
barcode_end_range::String(default:"1:end"):- Specifies the allowed range for the end position of the barcode alignment within the read. The format is the same as
ref_search_range. If the aligned barcode ends outside this range, it will not be considered a valid match.
- Specifies the allowed range for the end position of the barcode alignment within the read. The format is the same as
-
chunk_size::Int(default:4000):- Specifies the number of reads to process in a single chunk. Larger chunk sizes can reduce overhead but increase memory usage.
-
channel_capacity::Int(default:64):- Specifies the capacity of the input/output channels. The recycle channel capacity is automatically set to
2 * channel_capacity. Increasing this value can improve throughput on systems with high memory availability.
- Specifies the capacity of the input/output channels. The recycle channel capacity is automatically set to
-
barcode_file2::Union{String, Nothing}(default:nothing):- Specifies the path to the second barcode reference file for dual-index demultiplexing. If provided, the function performs a two-step alignment: first against
barcode_file, then againstbarcode_file2.
- Specifies the path to the second barcode reference file for dual-index demultiplexing. If provided, the function performs a two-step alignment: first against
-
ref_search_range2::String(default:"1:end"):- Specifies the search range for the second barcode. Format is the same as
ref_search_range.
- Specifies the search range for the second barcode. Format is the same as
-
barcode_start_range2::String(default:"1:end"):- Specifies the allowed start range for the second barcode. Format is the same as
barcode_start_range.
- Specifies the allowed start range for the second barcode. Format is the same as
-
barcode_end_range2::String(default:"1:end"):- Specifies the allowed end range for the second barcode. Format is the same as
barcode_end_range.
- Specifies the allowed end range for the second barcode. Format is the same as
-
summary::Bool(default:false):- If set to
true, a summary report is generated after demultiplexing. The report includes statistics on total reads, matched reads, unmatched reads, ambiguous reads, and detailed distributions of alignment scores, positions, and lengths for each barcode.
- If set to
-
summary_format::Symbol(default::html):- Specifies the format of the summary report. Options are:
:text: A human-readable text file (summary.txt).:html: An interactive HTML report (summary.html) with tabs and dropdowns for detailed visualization.:json: A JSON file (summary.json) for programmatic parsing.:stdout: Prints the summary to the standard output.
- Specifies the format of the summary report. Options are:
-
trim_side::Union{Int, Nothing}(default:nothing):- Specifies which side of the read to trim relative to the barcode.
3: Trims the 3' side (right side) of the barcode match. Keeps the sequence before the barcode.5: Trims the 5' side (left side) of the barcode match. Keeps the sequence after the barcode.
- If
nothing, no trimming is performed.
- Specifies which side of the read to trim relative to the barcode.
-
trim_side2::Union{Int, Nothing}(default:nothing):- Specifies trimming for the second barcode in dual-barcode mode. Options are the same as
trim_side.
- Specifies trimming for the second barcode in dual-barcode mode. Options are the same as
-
log::Bool(default:false):- If set to
true, basic logging information (start configuration and end duration) is printed to the standard error (stderr).
- If set to
We assume the case where barcode length is 10, max_error_rate is 0.2, min_delta is 0.2, mismatch is 1, indel is 2.
-
Maximum Allowed Penalty Score:
- With a
max_error_rateof 0.2 and a barcode length of 10, the maximum allowed penalty score for a sequence to still match a barcode is0.2 * 10 = 2.
- With a
-
Minimum Allowed Penalty Difference:
- With
min_delta = 0.2and a barcode length of 10, the minimum required difference in scores between the best and second-best barcode matches is0.2 * 10 = 2.
- With
-
Penalty Settings:
mismatch = 1: Each mismatch in the sequence alignment contributes a penalty of 1.indel = 2: Each insertion or deletion (indel) contributes a penalty of 2, making indels more costly than mismatches.
With these settings, the classification works as follows:
-
Allowed Error:
- Since the maximum allowed penalty score is 2 (
0.2 * 10):- The sequence can have up to 2 mismatches (since each mismatch has a penalty of 1).
- The sequence can have up to 1 indel (since each indel has a penalty of 2).
- For example, when the sequence have a combination of 1 mismatch and 1 indel, the penalty score is
1 (mismatch) + 2 (indel) = 3. Since this score exceeds the maximum allowed penalty score of 2, it would not be allowed.
- Since the maximum allowed penalty score is 2 (
-
Matching Process:
- During the alignment, the sequence is compared to each barcode in the reference file. The total penalty score (based on mismatches and indels) is calculated for each alignment.
- If a sequence's penalty score with a barcode exceeds 2, it cannot be classified under that barcode.
-
Ambiguous Classification:
- If a sequence matches multiple barcodes and the penalty scores of the best match and the second-best match differ by less than 2 (the minimum allowed penalty difference), the sequence will be classified into
ambiguous_classification.fastq. - For example, if the best matching barcode has a penalty score of 1 and the second-best has a score of 2, the difference is
2 - 1 = 1, which is less than the threshold of 2. Therefore, the sequence is ambiguous.
- If a sequence matches multiple barcodes and the penalty scores of the best match and the second-best match differ by less than 2 (the minimum allowed penalty difference), the sequence will be classified into
-
Unknown Classification:
- If the sequence fails to match any barcode within the maximum allowed penalty score of 2, it is classified as
unknown.fastq.
- If the sequence fails to match any barcode within the maximum allowed penalty score of 2, it is classified as
To ensure that BioDemuX.jl functions correctly, you can run tests using the Julia package's built-in testing functionality.
using Pkg
Pkg.test("BioDemuX")The datasets used in the paper are publicly available.
- In vitro dataset and test code:
Link to the dataset and test code: https://doi.org/10.24433/CO.5417078.v1
- In silico dataset:
Link to the dataset: https://doi.org/10.5281/zenodo.14178228
If you encounter any issues or have requests, please provide feedback by posting a new GitHub issue on our repository. We appreciate your input and will do our best to assist you!

