https://github.com/cesarb/blake2-rfc
I measure it to be about 2% faster than portable.rs. Not yet sure why, though it might be using some SIMD under the covers, or maybe getting optimized to SSE2 by the compiler.
However, the relationship is reversed if I set RUSTFLAGS="-C target-cpu=native -C target-feature=-avx2". No idea why. Again, still a small difference. Notably, both implementations tank their performance if I allow them to use AVX2.