-
Notifications
You must be signed in to change notification settings - Fork 1
Test fastxpp implmentations #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Woohoo! First draft is the the hard part! I left some comments, specifically around the removal of newlines after the fact, which I suspect is the thing actually holding the times back (maybe also number parsing though). Totally agree with your optimization ideas, making the number parsing fast should be done, we should skip the header after the first time, and we can make validation a parameter so that it's optional to turn on so that outside of check we got the right number of bytes, that's all we do for the fast path. |
ishlib/vendor/kseq.mojo
Outdated
| var need = disk # local copy; will be mutated | ||
| self.seq.clear() | ||
| self.seq.reserve(need) | ||
| var got = self.reader.read_bytes(self.seq, need) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be faster if you did loop of memcpy instead of stripping the newlines after the fact. Maybe we need to adjust the format a bit to make that easy? like use "bytes per line", "lines", last_line_bytes" or something.
That way we only have to touch each byte one time, to copy it into the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we just need total bytes and bytes per line.
var copied_bytes = 0
while copied_bytes < total_bytes:
copied_bytes += self.reader.read_bytes(self.seq, min(bytes_per_line, total_bytes - copied_bytes)
var _newline = self.reader.read_byte()
ishlib/vendor/kseq.mojo
Outdated
| var lcnt: Int = 0 | ||
| var cur: Int = 0 | ||
| var which: Int = 0 | ||
| for i in range(1, pos2): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can save the swap here by doing a loop per variable, which should be fine. And then it can be a single method you can optimize to death (in a good way :) ).
This is a PR to test fastxpp format. This is not for merging in, just reviewing the approach and path forward.
File changes
generate_fastxpp.mojo
This converts fasta format to fastxpp format. I have not verified it works for fastq yet.
I am only working with fasta for now.
I just used what worked first,. I did not optimize or benchmark yet as the exact format may change.
kseq.mojo
The original fastx read method was kept to make benchmarking easier.
read_fastxpp added as method to fastx reader struct.
There is a optimization for removing the newline character in place. I haven't tested how big of impact it makes, I just wanted to try it because it sounded cool.
fastxpp_bench.mojo
Right now I am just using hyperfine to benchmark with this helper file. It takes the read method to use as a param so I can use hyperfine like this:
Benchmarking results
Used hyperfine --warmup 3 -r 20
orig
strip_newline
swar
read_once
If else order of orig|strip_newline|fastxpp_swar|read_once|

RESULTS CHANGE WITH IF ELSE ORDER
Order of if else in this test: orig|strip_newline|read_once| fastxpp_swar

read_once methods maybe
It just narrowly edges out strip newline.
This method does not use lcnt, which is great because that field is 7 digits.
Uses this method we can reduce the comment field size.
Benchmarking notes
The input fasta was cat together 10x, bringing it from uncompressed size of 256mb to 2.6G, more inline with real world data sizes (still small).
I benchmarked on Ec2 Ubuntu instance with 1000mb/s throughput. That is important as our default ec2 setting for throughput is 125mb/s which is by far slower than the slowest SSD on the market.
Previously I was benchmarking with 'mojo run, but this lead to differences do to compiling time. Now build is ran before, changing some of the benches.
SWAR decode had big impact time so it was included in each fastxpp variant read method.
Optimizations & ideas:
1: Better ASCII number parsing (SWAR decode)
I think ASCII number parsing is relatively expensive. Not certain but there is a better ascii parsing method worth trying.
Fixed width header info + SWAR. It doesn't seem like it would be difficult to implment SWAR (simd within a register) https://lemire.me/blog/2022/01/21/swar-explained-parsing-eight-digits/
This optimization was used for https://lemire.me/blog/2021/01/29/number-parsing-at-a-gigabyte-per-second/
Right now I parse ascii the long way mentioned in article.
The SWAR method and having fixed sized fields like 0000 0000 0000, would avoid having to handle separators.
2: Better record header parsing
We should only need to parse for the >/@ header once. If we already did single pass we can either impute the offset for the next record start or store that in header info of previous record to avoid checks for empty lines and checks for record boundaries.
3: Checks up front
If we are parsing the fastx file to generate the header size info, we could add all the QC checks upfront and include flag field for QC checks performed. This can save a lot of time checking things like + or that quality == sequence length.
4: Better buffer allocation/sizing.
Still learning about buffers. But I feel there is better way to make use of this format.
5: Indexing
Indexing could probably give easy parallel speed up, but I think this would be better handled at gzip block decompression.
6: Metadata
I'm sure there is optimizations we can do knowing the size of every record and line throughout entire file. Where would we store this info though? Optional index file?