Jeff Sandoval 95a76886c1
[OpenMP][MLIR] Fix GPU teams reduction buffer size for by-ref reductions (#185460)
The `ReductionDataSize` field in `KernelEnvironmentTy` and the
`MaxDataSize` used to compute the `reduce_data_size` argument to
`__kmpc_nvptx_teams_reduce_nowait_v2` were both computed using pointer
types for by-ref reductions instead of the actual element types. This
caused the global teams reduction buffer to be undersized relative to
the offsets used by the copy/reduce callbacks, resulting in
out-of-bounds accesses faults at runtime.

For example, a by-ref reduction over `[4 x i32]` (16 bytes) would
allocate buffer slots based on `sizeof(ptr)` = 8 bytes, but the
generated callbacks would access 16 bytes per slot.

Fix both computation sites:

1. In MLIR's `getReductionDataSize()`, use
`DeclareReductionOp::getByrefElementType()` instead of `getType()` when
the reduction is by-ref, so the reduction buffer struct layout (and more
importantly its size) matches that emitted by the `OMPIRBuilder`.

2. In `OMPIRBuilder::createReductionsGPU()`, use
`ReductionInfo::ByRefElementType` instead of `ElementType` for by-ref
reductions when computing `MaxDataSize`. It seems that `MaxDataSize`
isn't actually used in the deviceRTL, but it's better to fix it to avoid
future propagation of this bug.

Finally, add CHECK lines to the existing array-descriptor reduction test
to verify both the kernel environment `ReductionDataSize` and the
`reduce_data_size` call argument reflect the actual element type size.

Assisted-by: Claude Opus 4.6

---------

Co-authored-by: Jeffrey Sandoval <jeffrey.sandoval@hpe.com>
2026-04-01 14:59:16 -05:00
..