PaddedMatrices.jl
Manual Outline
PaddedMatrices.MULTASKS
PaddedMatrices.jmulpackAB!
PaddedMatrices.jmulpackAonly!
PaddedMatrices.matmul
PaddedMatrices.matmul!
PaddedMatrices.matmul_serial!
PaddedMatrices.reseet_bcache_lock!
PaddedMatrices.@gc_preserve
PaddedMatrices.MULTASKS
— ConstantLength is one less than Base.nthreads()
PaddedMatrices.jmulpackAB!
— MethodPacks both arrays A
and B
. Primitely packs both A
and B
into column major temporaries.
Column-major B
is preferred over row-major, because without packing the stride across k
iterations of B
becomes excessive, and without nᵣ
being a multiple of the cacheline size, we would fail to make use of 100% of the loaded cachelines. Unfortunately, using column-major B
does mean that we are starved on integer registers within the macrokernel.
Once LoopVectorization
adds a few features to make it easy to abstract away tile-major memory layouts, we will switch to those, probably improving performance for larger matrices.
PaddedMatrices.jmulpackAonly!
— MethodOnly packs A
. Primitively does column-major packing: it packs blocks of A
into a column-major temporary.
PaddedMatrices.matmul!
— Methodmatmul!(C, A, B[, α, β, max_threads])
Calculates C = α * A * B + β * C
in place, overwriting the contents of A
. It may use up to max_threads
threads. It will not use threads when nested in other threaded code.
PaddedMatrices.matmul
— Methodmatmul(A, B)
Multiply matrices A
and B
.
PaddedMatrices.matmul_serial!
— Methodmatmul_serial!(C, A, B[, α = 1, β = 0])
Calculates C = α * (A * B) + β * C
in place.
A single threaded matrix-matrix-multiply implementation. Supports dynamically and statically sized arrays.
Organizationally, matmul_serial!
checks the arrays properties to try and dispatch to an appropriate implementation. If the arrays are small and statically sized, it will dispatch to an inlined multiply.
Otherwise, based on the array's size, whether they are transposed, and whether the columns are already aligned, it decides to not pack at all, to pack only A
, or to pack both arrays A
and B
.
PaddedMatrices.reseet_bcache_lock!
— Methodresetbcachelock!()
Currently not using try/finally in matmul routine, despite locking. So if it errors for some reason, you may need to manually call reset_bcache_lock!()
.
PaddedMatrices.@gc_preserve
— Macro@gc_preserve foo(A, B, C)
Apply to a single, non-nested, function call. It will GC.@preserve
all the arguments, and substitute suitable arrays with PtrArray
s. This has the benefit of potentially allowing statically sized mutable arrays to be both stack allocated, and passed through a non-inlined function boundary.