Use muladd in matmul and improved operation order.#818
Use muladd in matmul and improved operation order.#818chriselrod wants to merge 3 commits intoJuliaArrays:masterfrom
Conversation
|
Overall it is an improvement. But if you actually want it to be fast, it needs SIMD intrinsics. |
|
This partially conflicts with #814 , did you look at my PR? It is more or less complete now, I'm just working on slightly better heuristics for picking the right multiplication method. Now I'm wondering if I should integrate this PR first. |
|
I can definitely change my PR to use In any case I will merge your branch into mine. |
|
The reordering was supposed to make things a little more similar to PaddedMatrices, which is far faster at most sizes: https://github.com/chriselrod/PaddedMatrices.jl But good idea to test the three different implementations. Skylake-X (AVX512): Worth pointing all of these are done on the same computer by starting Julia with different Script: using StaticArrays, BenchmarkTools, Plots; plotly();
function bench_muls(sizerange)
res = Matrix{Float64}(undef, length(sizerange), 3)
for (i,r) ∈ enumerate(sizerange)
A = @SMatrix rand(r,r)
B = @SMatrix rand(r,r)
res[i,1] = @belapsed StaticArrays.mul_unrolled($(Size(A)), $(Size(B)), $(Ref(A))[], $(Ref(B))[])
res[i,2] = @belapsed StaticArrays.mul_unrolled_chunks($(Size(A)), $(Size(B)), $(Ref(A))[], $(Ref(B))[])
res[i,3] = @belapsed StaticArrays.mul_loop($(Size(A)), $(Size(B)), $(Ref(A))[], $(Ref(B))[])
end
res
end
benchtimes = bench_muls(1:24);
plot(1:24, 2e-9 .* (1:24) .^ 3 ./ benchtimes, labels = ["mul_unrolled" "mul_unrolled_chunks" "mul_loop"]);
xlabel!("Matrix Dimensions"); ylabel!("GFLOPS") |
|
Hmm... I'll have to re-run my benchmarks with |
|
Also, worth pointing out that a difference between running a recent CPU with On skylake, addition's throughput was doubled, so you only have half instead of 1/3. Meaning real Haswell should benefit a lot from muladd. |





Performance should generally be better under this PR for
Float64inputs given AVX(2).Here, I ran the benchmark in different Julia sessions and then
printlnthe results from one.Benefits are minor with AVX512, but I'll run AVX2 benchmarks too:
