using DoubleFloats, BenchmarkTools, LinearAlgebra
const n=1000
A=randn(Double64,n,n)
B=randn(Double64,n,n)
C=zeros(Double64,n,n)
@btime mul!($C,$A,$B);
gives
21.144 s (0 allocations: 0 bytes)
while
@btime mul!($C,$A,$B,true,false);
gives
13.402 s (3 allocations: 29.81 KiB)
Shouldn't mul!(C,A,B) be realized as mul!(C,A,B,true,false)? And why is the difference in performance and allocations?