PaddedMatrices.jl

Manual Outline

PaddedMatrices.jmul!Method

jmul!(C, A, B[, α = 1, β = 0])

Calculates C = α * (A * B) + β * C in place.

A single threaded matrix-matrix-multiply implementation. Supports dynamically and statically sized arrays.

Organizationally, jmul! checks the arrays properties to try and dispatch to an appropriate implementation. If the arrays are small and statically sized, it will dispatch to an inlined multiply.

Otherwise, based on the array's size, whether they are transposed, and whether the columns are already aligned, it decides to not pack at all, to pack only A, or to pack both arrays A and B.

source
PaddedMatrices.jmulpackAB!Method

Packs both arrays A and B. Primitely packs both A and B into column major temporaries.

Column-major B is preferred over row-major, because without packing the stride across k iterations of B becomes excessive, and without nᵣ being a multiple of the cacheline size, we would fail to make use of 100% of the loaded cachelines. Unfortunately, using column-major B does mean that we are starved on integer registers within the macrokernel.

Once LoopVectorization adds a few features to make it easy to abstract away tile-major memory layouts, we will switch to those, probably improving performance for larger matrices.

source
PaddedMatrices.@gc_preserveMacro

@gc_preserve foo(A, B, C)

Apply to a single, non-nested, function call. It will GC.@preserve all the arguments, and substitute suitable arrays with PtrArrays. This has the benefit of potentially allowing statically sized mutable arrays to be both stack allocated, and passed through a non-inlined function boundary.

source