[MLIR][XeGPU][VectorToXeGPU] Lower vector.load/store/transfer_read/transfer_write to new offsets syntax (#162095)
Changes the `VectorToXeGPU` pass to generate `xegpu.load_nd/store_nd`
ops using new syntax with where offsets are specified at the load/store
ops level.
```mlir
// from this
%desc = xegpu.create_nd_tdesc %src[%off1, %off2]: memref<8x16xf16> -> !xegpu.tensor_desc<8x16xf16>
%res = xegpu.load_nd %desc : !xegpu.tensor_desc<8x16xf16> -> vector<8x16xf16>
// to this
%desc = xegpu.create_nd_tdesc %src: memref<8x16xf16> -> !xegpu.tensor_desc<8x16xf16>
%res = xegpu.load_nd %desc[%off1, %off2] : !xegpu.tensor_desc<8x16xf16> -> vector<8x16xf16>
```
In order to support cases with dimension reduction at the
`create_nd_tdesc` level (e.g. `memref<8x8x16xf16> ->
tensor_desc<8x16xf16>` it was decided to insert a memref.subview that
collapses the source shape to 2d, for example:
```mlir
// input:
%0 = vector.load %source[%off0, %off1, %off2] : memref<8x16x32xf32>, vector<8x16xf32>
// --vector-to-xegpu (old)
%tdesc = xegpu.create_nd_tdesc %source[%off0, %off1, %off2] : memref<8x16x32xf32> -> tdesc<8x32xf32>
%vec = xegpu.load_nd %tdesc
// --vector-to-xegpu (new)
%collapsed = memref.subview %source[%off0, 0, 0] [1, 16, 32] [1, 1, 1] :
memref<8x16x32xf32> -> memref<16x32xf32, strided<[32, 1], offset: ?>>
%tdesc = xegpu.create_nd_tdesc %collapsed : memref<16x32xf32, ...> -> tdesc<8x32xf32>
%vec = xegpu.load_nd %tdesc[%off1, %off2]
```
<details><summary>Why we need to change that?</summary>
```mlir
// reduce dim and apply all 3 offsets at load_nd
%desc = xegpu.create_nd_tdesc %source : memref<8x16x32xf32> -> !xegpu.tensor_desc<16x32xf32>
// error: xegpu.load_nd len(offsets) != desc.rank
%res = xegpu.load_nd %desc[%off, %off, %off] : !xegpu.tensor_desc<16x32xf32> -> vector<8x16xf32>
```
</details>
---------
Signed-off-by: dchigarev <dmitry.chigarev@intel.com>