A `transform` pass to lower `vector.contract` to (a) `vector.fma` for
`F32`, (b) `x86vector.avx512.dot` for `BF16`, (c) `x86vector.avx.dot.i8`
for `Int8` packed types.
The lowering works on condition with `m`, `batch`, `k` dims to be `one`
and `vnni` dim should be `2` for `bf16`; `4` for `int8`.
**The lowering pattern**: `batch_reduce.matmul` (input) ->
register-tiling(M, N) -> Vectorization (to `vector.contract`) ->
`unroll` vector.contract (`unit` dims) -> `hoisting` transformation
(move `C` loads/store outside batch/k loop) -> apply `licm`,
`canonicalization`, and `bufferize`.