PaddedMatrices.jl

Manual Outline

PaddedMatrices.jmulpackAB!Method

Packs both arrays A and B. Primitely packs both A and B into column major temporaries.

Column-major B is preferred over row-major, because without packing the stride across k iterations of B becomes excessive, and without nᵣ being a multiple of the cacheline size, we would fail to make use of 100% of the loaded cachelines. Unfortunately, using column-major B does mean that we are starved on integer registers within the macrokernel.

Once LoopVectorization adds a few features to make it easy to abstract away tile-major memory layouts, we will switch to those, probably improving performance for larger matrices.

source
PaddedMatrices.matmul!Method
matmul!(C, A, B[, α, β, max_threads])

Calculates C = α * A * B + β * C in place, overwriting the contents of A. It may use up to max_threads threads. It will not use threads when nested in other threaded code.

source
PaddedMatrices.matmul_serial!Method

matmul_serial!(C, A, B[, α = 1, β = 0])

Calculates C = α * (A * B) + β * C in place.

A single threaded matrix-matrix-multiply implementation. Supports dynamically and statically sized arrays.

Organizationally, matmul_serial! checks the arrays properties to try and dispatch to an appropriate implementation. If the arrays are small and statically sized, it will dispatch to an inlined multiply.

Otherwise, based on the array's size, whether they are transposed, and whether the columns are already aligned, it decides to not pack at all, to pack only A, or to pack both arrays A and B.

source
PaddedMatrices.reseet_bcache_lock!Method

resetbcachelock!()

Currently not using try/finally in matmul routine, despite locking. So if it errors for some reason, you may need to manually call reset_bcache_lock!().

source
PaddedMatrices.@gc_preserveMacro

@gc_preserve foo(A, B, C)

Apply to a single, non-nested, function call. It will GC.@preserve all the arguments, and substitute suitable arrays with PtrArrays. This has the benefit of potentially allowing statically sized mutable arrays to be both stack allocated, and passed through a non-inlined function boundary.

source