This PR extends XeGPU lane layout to support wrap-around distribution,
enabling replication of lane-level tensor tiles across all lanes when
the tile size matches lane_data along a given dimension. Previously,
distribution required the tile size to exceed the number of lanes ×
lane_data for even partitioning.
This PR also refactors layout attribute interface functions:
computeDistributedShape() computes the distributed vector shape and is
shared by work-to-subgroup and subgroup-to-lane distribution, which
follow the same distribution rule (even or wrap-around).
computeStaticDistributedCoords() computes compile-time distributed
coordinates of sub-tiles per subgroup/lane. It is the compile-time
counterpart of computeDistributedCoords() and is used by
isCompatibleWith().