Adel Ejjeh 49250284cf
[AMDGPU][LoopUnroll] Enable allowexpensivetripcount for amdgpu when unroll pragmas are present (#181241)
This PR is intended as an AMDGPU-specific solution for #181267 while
discussions on changing the default behavior for all targets continue in
that PR.

Problem:
Loops with an explicit unroll pragma (#pragma unroll / #pragma clang
loop unroll(enable)) that have an expensive runtime trip count currently
don't get unrolled because UP.AllowExpensiveTripCount defaults to false.
The pragma is silently ignored. This is not the case when an unroll
factor is specified (PragmaCount > 0), where the pass sets
UP.AllowExpensiveTripCount = true.

Solution:
Set UP.AllowExpensiveTripCount to true for for loops that have an unroll
pramga in the AMDGPU TTI Implementation.
I've added a new lit test expensive-tripcount.ll that verifies
pragma-driven unrolling with expensive trip counts will work as
expected.

The change showed no meaningful regressions across a few different
workloads from Composable Kernels (CK) and llama.cpp as well as Pytorch
kernels on AMDGPU gfx950. Additionally, the change improves the
performance of PyTorch reduction loops on AMDGPU targets.
2026-03-13 09:56:20 -07:00
..