-
Notifications
You must be signed in to change notification settings - Fork 7
Description
Hello,
I noticed that specifying the linker sequence, instead of the linker_length in the config results in fewer barcode assignments in the output.
After some testing, I think this happens because when the linker sequence is provided, the pipeline uses Cutadapt to extract barcodes rather than trimming by fixed length. This produces a barcode file where barcodes can vary in size, and any barcode that doesn’t match the bc_length specified in the config file gets filtered out later in the pipeline.
If the barcode is always the first "N" bases, it seems better to specify only the linker_length instead of the linker sequence in the assignment config. That way, the pipeline extracts barcodes purely by length, avoiding Cutadapt’s variable trimming behavior.
Example:
On a small test set (40,000 reads):
- When specifying the linker sequence, I got 33,994 assignments.
- When specifying only the linker length, I got 34,564 assignments.
Suggestion:
I wonder if it might be helpful to clarify in the documentation that providing a linker sequence triggers Cutadapt-based barcode extraction, which can sometimes result in fewer assignments. Specifying only the linker length uses strict position-based extraction and may be more appropriate when the barcode is at a fixed position.
Thanks so much for all the work on this pipeline!
Grace