Current lowering pattern for create_nd_tdesc restricts source memref to
static shape.
In case of a dynamic ranked memref, create_nd_tdesc already provides
shape as an argument.
Lowering can use those values instead of returning a mismatch error.
This PR adds lowering of xegpu.load_matrix/store_matrix to
xevm.blockload/blockstore or and llvm.load/store, depending on wi level
attributes.
It includes a few components:
1. adds wi-level attributes: subgroup_block_io.
2. expand load_matrix/store_matrix op definition to support scalar data
(besides vector data).
2. adds a member function to mem_desc to compute the linearized address
for a nd offsets.
3. add lowering depending on wi-level attributes:
a) if subgroup_block_io attribute presents, lower to
xevm.blockload/blockstore
c) else lower to llvm.load/store. If result is a vector, lower to
llvm.load/store with vector operand.
Use wrappers around `std::accumulate` to make the code more concise and
less bug-prone: https://github.com/llvm/llvm-project/pull/162129.
With `std::accumulate`, it's the initial value that determines the
accumulator type. `llvm::sum_of` and `llvm::product_of` pick the right
accumulator type based on the range element type.
Found some funny bugs like a local accumulate helper that calculated a
sum with initial value of 1 -- we didn't hit the bug because the code
was actually dead...
Fixes two issue with XeGPU to XeVM pass
1. xegpu.update_nd_offset op lower generated incorrect code sequence
2. xegpu.store_nd did not lower single element vector