This PR adds support for lowering of `vector.reduction` and
`vector.multi_reduction` ops in subgroup to work-item distribution.
Following cases are considered currently (more support will be added
later):
* `vector.reduction` : This assumes the source vector is distributed to
all lanes and lanes must shuffle data to do a collaborative reduction.
result is shared among all lanes. This is done by emitting
`gpu::ShuffleOp` s and doing a butterfly reduction. Refer
`VectorDistribution` for more details.
* `vector.multi_reduction`: 2 cases are considered,
1. **Reduction is lane-local**: simply lower to a lane local multi
reduction op. each lane does its own reduction. result is distributed.
2. **Reduction is not lane-local:** This one is handled indirectly. In
this case, we rewrite the reduction in terms of `vector.reduction` ops
(plus exrtact. insert) before the WI distribution even begin. Then whole
things is distributed using `gpu::ShuffleOp` s later (not fullly
supported yet).