Haswell

The Haswell CPU benchmarked here is a 1.7 GHz laptop CPU. It features two 256-bit FMA units, which gives it comparable peak FLOPS/cycle to Tigerlake. But, with its smaller caches, fewer and smaller registers necessitating churning over the cache more quickly, and more limited out of order capabilities, it is much more difficult to achieve peak performance on Haswell.

Statically sized benchmarks vs StaticArrays.jl: sizedbenchmarks

The SMatrix and MMatrix are the immutable and immutable matrix types from StaticArrays.jl, respectively, while StrideArray.jl and PtrArray.jl are mutable array types with optional static sizing providing by PaddedMatrices.jl. The benchmarks also included jmul! on base Matrix{Float64}, demonstrating the performance of PaddedMatrices's fully dynamic multiplication function.

SMatrix were only benchmarked up to size 20x20. As their performance at larger sizes recently increased, I'll increase the size range at which I benchmark them in the future.

The fully dynamic multiplication is competitive with MKL and OpenBLAS from around 2x2 to 256x256: dgemmbenchmarkssmall dgemmbenchmarksmedium

Benchmarks will be added later.