Test fastxpp implmentations #14

Trecek · 2025-04-21T18:08:45Z

This is a PR to test fastxpp format. This is not for merging in, just reviewing the approach and path forward.

File changes

generate_fastxpp.mojo

This converts fasta format to fastxpp format. I have not verified it works for fastq yet.
I am only working with fasta for now.
I just used what worked first,. I did not optimize or benchmark yet as the exact format may change.

kseq.mojo

The original fastx read method was kept to make benchmarking easier.
read_fastxpp added as method to fastx reader struct.

There is a optimization for removing the newline character in place. I haven't tested how big of impact it makes, I just wanted to try it because it sounded cool.

fastxpp_bench.mojo

Right now I am just using hyperfine to benchmark with this helper file. It takes the read method to use as a param so I can use hyperfine like this:

hyperfine --warmup 5 -r 50 \
    'mojo run fastxpp_bench.mojo /analysis/git_projects/ish_fastx/data/uniprot_sprot.fastxpp orig' \
    'mojo run fastxpp_bench.mojo /analysis/git_projects/ish_fastx/data/uniprot_sprot.fastxpp fastxpp'

Benchmarking results

Used hyperfine --warmup 3 -r 20

orig

Original read method that uses no header comment info

strip_newline

uses header info, does not use bpl, only uses slen and lcnt
SWAR decode
In place compaction to remove newlines

swar

Uses header info, lcnt, bpl, slen
SWAR Decode
Use bytes per line to remove newlines, using memcpy with jump over newlines

read_once

Uses header info, but does not need lcnt. Only uses slen and bpl.
SWAR decode
Using bpl to read bytes up to newlines, consume newline with readbyte.

If else order of orig|strip_newline|fastxpp_swar|read_once|

Benchmark 1: ./fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x orig
  Time (mean ± σ):      1.560 s ±  0.011 s    [User: 1.246 s, System: 0.293 s]
  Range (min … max):    1.548 s …  1.594 s    20 runs
 
Benchmark 2: ./fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x strip_newline
  Time (mean ± σ):      1.303 s ±  0.012 s    [User: 0.995 s, System: 0.288 s]
  Range (min … max):    1.287 s …  1.337 s    20 runs
 
Benchmark 3: ./fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x swar
  Time (mean ± σ):      1.335 s ±  0.012 s    [User: 1.015 s, System: 0.299 s]
  Range (min … max):    1.317 s …  1.354 s    20 runs
 
Benchmark 4: ./fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x read_once
  Time (mean ± σ):      1.267 s ±  0.009 s    [User: 0.942 s, System: 0.304 s]
  Range (min … max):    1.251 s …  1.286 s    20 runs
 
Summary
  ./fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x read_once ran
    1.03 ± 0.01 times faster than ./fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x strip_newline
    1.05 ± 0.01 times faster than ./fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x swar
    1.23 ± 0.01 times faster than ./fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x orig

RESULTS CHANGE WITH IF ELSE ORDER

Order of if else in this test: orig|strip_newline|read_once| fastxpp_swar

read_once methods maybe

It just narrowly edges out strip newline.
This method does not use lcnt, which is great because that field is 7 digits.
Uses this method we can reduce the comment field size.

Benchmarking notes

The input fasta was cat together 10x, bringing it from uncompressed size of 256mb to 2.6G, more inline with real world data sizes (still small).

I benchmarked on Ec2 Ubuntu instance with 1000mb/s throughput. That is important as our default ec2 setting for throughput is 125mb/s which is by far slower than the slowest SSD on the market.

Previously I was benchmarking with 'mojo run, but this lead to differences do to compiling time. Now build is ran before, changing some of the benches.

SWAR decode had big impact time so it was included in each fastxpp variant read method.

hyperfine --warmup 2 -r 10 \
    './fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x orig' \
    './fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x strip_newline' \
    './fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x swar' \
    './fastxpp_bench uniprot_sprot.fastxpp_swar_nohlen_10x read_once'

Optimizations & ideas:

1: Better ASCII number parsing (SWAR decode)

I think ASCII number parsing is relatively expensive. Not certain but there is a better ascii parsing method worth trying.
Fixed width header info + SWAR. It doesn't seem like it would be difficult to implment SWAR (simd within a register) https://lemire.me/blog/2022/01/21/swar-explained-parsing-eight-digits/
This optimization was used for https://lemire.me/blog/2021/01/29/number-parsing-at-a-gigabyte-per-second/
Right now I parse ascii the long way mentioned in article.
The SWAR method and having fixed sized fields like 0000 0000 0000, would avoid having to handle separators.

2: Better record header parsing

We should only need to parse for the >/@ header once. If we already did single pass we can either impute the offset for the next record start or store that in header info of previous record to avoid checks for empty lines and checks for record boundaries.

3: Checks up front

If we are parsing the fastx file to generate the header size info, we could add all the QC checks upfront and include flag field for QC checks performed. This can save a lot of time checking things like + or that quality == sequence length.

4: Better buffer allocation/sizing.

Still learning about buffers. But I feel there is better way to make use of this format.

5: Indexing

Indexing could probably give easy parallel speed up, but I think this would be better handled at gzip block decompression.

6: Metadata

I'm sure there is optimizations we can do knowing the size of every record and line throughout entire file. Where would we store this info though? Optional index file?

sstadick · 2025-04-21T18:49:12Z

Woohoo! First draft is the the hard part!

I left some comments, specifically around the removal of newlines after the fact, which I suspect is the thing actually holding the times back (maybe also number parsing though).

Totally agree with your optimization ideas, making the number parsing fast should be done, we should skip the header after the first time, and we can make validation a parameter so that it's optional to turn on so that outside of check we got the right number of bytes, that's all we do for the fast path.

sstadick · 2025-04-21T18:35:54Z

ishlib/vendor/kseq.mojo

+        var need = disk  # local copy; will be mutated
+        self.seq.clear()
+        self.seq.reserve(need)
+        var got = self.reader.read_bytes(self.seq, need)


I think this would be faster if you did loop of memcpy instead of stripping the newlines after the fact. Maybe we need to adjust the format a bit to make that easy? like use "bytes per line", "lines", last_line_bytes" or something.

That way we only have to touch each byte one time, to copy it into the list.

Maybe we just need total bytes and bytes per line.

var copied_bytes = 0 while copied_bytes < total_bytes: copied_bytes += self.reader.read_bytes(self.seq, min(bytes_per_line, total_bytes - copied_bytes) var _newline = self.reader.read_byte()

ishlib/vendor/kseq.mojo

sstadick · 2025-04-21T18:40:15Z

ishlib/vendor/kseq.mojo

+        var lcnt: Int = 0
+        var cur: Int = 0
+        var which: Int = 0
+        for i in range(1, pos2):


I think you can save the swap here by doing a loop per variable, which should be fine. And then it can be a single method you can optimize to death (in a good way :) ).

ishlib/formats/swar_decode.mojo

ishlib/vendor/kseq.mojo

ishlib/formats/fastxpp_bench.mojo

TalonTrecek added 4 commits April 21, 2025 16:45

Add support for generating fastxpp format

d299d67

Add support for reading fastxpp format

250ef26

Add file for benchmarking

0d210b4

Comments

f229a57

Trecek requested a review from sstadick April 21, 2025 18:12

sstadick reviewed Apr 21, 2025

View reviewed changes

TalonTrecek added 2 commits April 23, 2025 16:40

Add bpl

69830a3

Add bpl generation

f1a1df3

Trecek changed the title ~~[BFXINT-906] Implement fastxpp~~ Test fastxpp implmentations Apr 24, 2025

Add swar decode

e3c3c8b

sstadick reviewed Apr 24, 2025

View reviewed changes

ishlib/formats/swar_decode.mojo Outdated Show resolved Hide resolved

ishlib/formats/swar_decode.mojo Outdated Show resolved Hide resolved

ishlib/formats/swar_decode.mojo Outdated Show resolved Hide resolved

ishlib/vendor/kseq.mojo Outdated Show resolved Hide resolved

TalonTrecek added 3 commits April 24, 2025 23:36

remove hlen

a4614eb

move swar decode for eventual vendor

ae738bb

clean up

d20d1b0

sstadick reviewed Apr 25, 2025

View reviewed changes

ishlib/formats/fastxpp_bench.mojo Outdated Show resolved Hide resolved

TalonTrecek added 6 commits April 25, 2025 15:30

Switch to gzip to open

f82709d

Add memcpy free method

9c4697c

add read once test

8d56122

remove toying around code

b3b11d5

cleanup

6462073

Clean up

b7a8cd4

Trecek mentioned this pull request May 5, 2025

Fastxpp benchmark experiments #21

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Test fastxpp implmentations #14

Test fastxpp implmentations #14

Uh oh!

Trecek commented Apr 21, 2025 •

edited

Loading

Uh oh!

sstadick commented Apr 21, 2025

Uh oh!

sstadick Apr 21, 2025

Uh oh!

sstadick Apr 21, 2025

Uh oh!

Uh oh!

sstadick Apr 21, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Test fastxpp implmentations #14

Are you sure you want to change the base?

Test fastxpp implmentations #14

Uh oh!

Conversation

Trecek commented Apr 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

File changes

generate_fastxpp.mojo

kseq.mojo

fastxpp_bench.mojo

Benchmarking results

orig

strip_newline

swar

read_once

RESULTS CHANGE WITH IF ELSE ORDER

read_once methods maybe

Benchmarking notes

Optimizations & ideas:

1: Better ASCII number parsing (SWAR decode)

2: Better record header parsing

3: Checks up front

4: Better buffer allocation/sizing.

5: Indexing

6: Metadata

Uh oh!

sstadick commented Apr 21, 2025

Uh oh!

sstadick Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

sstadick Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sstadick Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Trecek commented Apr 21, 2025 •

edited

Loading