Skip to content

Conversation

@Trecek
Copy link

@Trecek Trecek commented Apr 21, 2025

This is a PR to test fastxpp format. This is not for merging in, just reviewing the approach and path forward.

File changes

generate_fastxpp.mojo

This converts fasta format to fastxpp format. I have not verified it works for fastq yet.
I am only working with fasta for now.
I just used what worked first,. I did not optimize or benchmark yet as the exact format may change.

kseq.mojo

The original fastx read method was kept to make benchmarking easier.
read_fastxpp added as method to fastx reader struct.

There is a optimization for removing the newline character in place. I haven't tested how big of impact it makes, I just wanted to try it because it sounded cool.

fastxpp_bench.mojo

Right now I am just using hyperfine to benchmark with this helper file. It takes the read method to use as a param so I can use hyperfine like this:

hyperfine --warmup 5 -r 50 \
    'mojo run fastxpp_bench.mojo /analysis/git_projects/ish_fastx/data/uniprot_sprot.fastxpp orig' \
    'mojo run fastxpp_bench.mojo /analysis/git_projects/ish_fastx/data/uniprot_sprot.fastxpp fastxpp'

Benchmarking results

Used hyperfine --warmup 3 -r 20

orig

  • Original read method that uses no header comment info

strip_newline

  • uses header info, does not use bpl, only uses slen and lcnt
  • SWAR decode
  • In place compaction to remove newlines

swar

  • Uses header info, lcnt, bpl, slen
  • SWAR Decode
  • Use bytes per line to remove newlines, using memcpy with jump over newlines

read_once

  • Uses header info, but does not need lcnt. Only uses slen and bpl.
  • SWAR decode
  • Using bpl to read bytes up to newlines, consume newline with readbyte.

If else order of orig|strip_newline|fastxpp_swar|read_once|
image

Benchmark 1: ./fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x orig
  Time (mean ± σ):      1.560 s ±  0.011 s    [User: 1.246 s, System: 0.293 s]
  Range (min … max):    1.548 s …  1.594 s    20 runs
 
Benchmark 2: ./fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x strip_newline
  Time (mean ± σ):      1.303 s ±  0.012 s    [User: 0.995 s, System: 0.288 s]
  Range (min … max):    1.287 s …  1.337 s    20 runs
 
Benchmark 3: ./fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x swar
  Time (mean ± σ):      1.335 s ±  0.012 s    [User: 1.015 s, System: 0.299 s]
  Range (min … max):    1.317 s …  1.354 s    20 runs
 
Benchmark 4: ./fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x read_once
  Time (mean ± σ):      1.267 s ±  0.009 s    [User: 0.942 s, System: 0.304 s]
  Range (min … max):    1.251 s …  1.286 s    20 runs
 
Summary
  ./fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x read_once ran
    1.03 ± 0.01 times faster than ./fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x strip_newline
    1.05 ± 0.01 times faster than ./fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x swar
    1.23 ± 0.01 times faster than ./fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x orig

RESULTS CHANGE WITH IF ELSE ORDER

Order of if else in this test: orig|strip_newline|read_once| fastxpp_swar
image

read_once methods maybe

It just narrowly edges out strip newline.
This method does not use lcnt, which is great because that field is 7 digits.
Uses this method we can reduce the comment field size.

Benchmarking notes

The input fasta was cat together 10x, bringing it from uncompressed size of 256mb to 2.6G, more inline with real world data sizes (still small).

I benchmarked on Ec2 Ubuntu instance with 1000mb/s throughput. That is important as our default ec2 setting for throughput is 125mb/s which is by far slower than the slowest SSD on the market.

Previously I was benchmarking with 'mojo run, but this lead to differences do to compiling time. Now build is ran before, changing some of the benches.

SWAR decode had big impact time so it was included in each fastxpp variant read method.

hyperfine --warmup 2 -r 10 \
    './fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x orig' \
    './fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x strip_newline' \
    './fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x swar' \
    './fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x read_once'

Optimizations & ideas:

1: Better ASCII number parsing (SWAR decode)

I think ASCII number parsing is relatively expensive. Not certain but there is a better ascii parsing method worth trying.
Fixed width header info + SWAR. It doesn't seem like it would be difficult to implment SWAR (simd within a register) https://lemire.me/blog/2022/01/21/swar-explained-parsing-eight-digits/
This optimization was used for https://lemire.me/blog/2021/01/29/number-parsing-at-a-gigabyte-per-second/
Right now I parse ascii the long way mentioned in article.
The SWAR method and having fixed sized fields like 0000 0000 0000, would avoid having to handle separators.

2: Better record header parsing

We should only need to parse for the >/@ header once. If we already did single pass we can either impute the offset for the next record start or store that in header info of previous record to avoid checks for empty lines and checks for record boundaries.

3: Checks up front

If we are parsing the fastx file to generate the header size info, we could add all the QC checks upfront and include flag field for QC checks performed. This can save a lot of time checking things like + or that quality == sequence length.

4: Better buffer allocation/sizing.

Still learning about buffers. But I feel there is better way to make use of this format.

5: Indexing

Indexing could probably give easy parallel speed up, but I think this would be better handled at gzip block decompression.

6: Metadata

I'm sure there is optimizations we can do knowing the size of every record and line throughout entire file. Where would we store this info though? Optional index file?

@Trecek Trecek requested a review from sstadick April 21, 2025 18:12
@sstadick
Copy link
Contributor

Woohoo! First draft is the the hard part!

I left some comments, specifically around the removal of newlines after the fact, which I suspect is the thing actually holding the times back (maybe also number parsing though).

Totally agree with your optimization ideas, making the number parsing fast should be done, we should skip the header after the first time, and we can make validation a parameter so that it's optional to turn on so that outside of check we got the right number of bytes, that's all we do for the fast path.

var need = disk # local copy; will be mutated
self.seq.clear()
self.seq.reserve(need)
var got = self.reader.read_bytes(self.seq, need)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be faster if you did loop of memcpy instead of stripping the newlines after the fact. Maybe we need to adjust the format a bit to make that easy? like use "bytes per line", "lines", last_line_bytes" or something.

That way we only have to touch each byte one time, to copy it into the list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we just need total bytes and bytes per line.

var copied_bytes = 0
while copied_bytes < total_bytes:
    copied_bytes += self.reader.read_bytes(self.seq, min(bytes_per_line, total_bytes - copied_bytes)
    var _newline = self.reader.read_byte() 

var lcnt: Int = 0
var cur: Int = 0
var which: Int = 0
for i in range(1, pos2):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can save the swap here by doing a loop per variable, which should be fine. And then it can be a single method you can optimize to death (in a good way :) ).

@Trecek Trecek changed the title [BFXINT-906] Implement fastxpp Test fastxpp implmentations Apr 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants