Tigerlake

Tigerlake CPUs feature just a single 512-bit-fma unit, and thus their theoretical peak FLOPS are comparable with AVX2 CPUs featuing two 256-bit FMA units, such as Intel's Skylake or AMD's Zen2. The much larger register file that AVX512 provides combined with its comparatively much larger L1 and L2 caches (and no doubt helped by the large out of order buffer) make it comparatively very easy to attain near peak performance on Tigerlake.

Statically sized benchmarks vs StaticArrays.jl: sizedbenchmarks

The SMatrix and MMatrix are the immutable and immutable matrix types from StaticArrays.jl, respectively, while StrideArray.jl and PtrArray.jl are mutable array types with optional static sizing providing by PaddedMatrices.jl. The benchmarks also included jmul! on base Matrix{Float64}, demonstrating the performance of PaddedMatrices's fully dynamic multiplication function.

The version of OpenBLAS used (0.3.10) didn't support Tigerlake yet. Unlike Cascadelake, where approaching the CPU's peak performance can be challenging, it is easy with Tigerlake: Tigerlake has much larger caches and reorder buffers, making it much more capable of feeding the execution units, but has half as many of them to feed as cascadelake for these workloads (1 FMA unit vs 2 FMA units).

Threaded results of the dynamic matmul: threadedbenchmarks

Single threaded, the fully dynamic multiplication is competitive with MKL and OpenBLAS from around 2x2 to 256x256: dgemmbenchmarkssmall Unlike the Cascadelake CPU, it was able to hold on with MKL at least through 2000x2000: dgemmbenchmarksmedium