Fix bitcast test, which was splitting apart phis intended to force
bitcasts that survive all the way to selection.
Disable the amdgpu-codegenprepare phi splitting, which defeats the technique
of using a phi to ensure a bitcast reaches all the way to selection. Also
add a variety of bfloat tests. These probably need revisiting to avoid the
cast folding into argument loads. Also round out set of bfloat bitcast and
ABI tests.
Add codegen tests for more bf16 operations The promotion of these works
contrary to the comment.
It is currently disabled by default. It will need experiments on a real
HW to tune and decide on the profitability.
---------
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
This will result in larger atomic operations getting expanded to
`__atomic_*` libcalls via AtomicExpandPass, which matches what Clang
already does in the frontend.
While AMDGPU currently disables the use of all libcalls, I've changed it
to instead disable all of them _except_ the atomic ones. Those are
already be emitted by the Clang frontend, and enabling them in the
backend allows the same behavior there.
We were only folding cases which remained extloads, but DAG.getExtLoad can also handle the cases which don't need to extend at all (we just can't do truncloads).
reduceLoadWidth can handle this for scalar loads, but not for vectors.
Noticed while triaging D152928
Physical VGPRs used for SGPR spills need to be tracked independent of
WWM reserved registers. The WWM reserved set contains extra registers
allocated during WWM pre-allocation pass.
This causes SGPR spills allocated after WWM pre-allocation to overlap
with WWM register usage, e.g. if frame pointer is spilt during
prologue/epilog insertion.
This is an experimental address space for strided buffers. These buffers
can have structs as elements and
a stride > 1.
These pointers allow the indexed access in units of stride, i.e., they
point at `buffer[index * stride]`.
Thus, we can use the `idxen` modifier for buffer loads.
We assign address space 9 to 192-bit buffer pointers which contain a
128-bit descriptor, a 32-bit offset and a 32-bit index. Essentially,
they are fat buffer pointers with an additional 32-bit index.
Add empty AMDGPUGlobalISelDivergenceLowering pass. This pass will
implement
- selection of divergent i1 phis as lane mask phis, requires lane mask
merging in some cases
- lower uses of divergent i1 values outside of the cycle using lane mask
merging
- lowering of all cases of temporal divergence:
- lower uses of uniform i1 values outside of the cycle using lane mask
merging
- lower uses of uniform non-i1 values outside of the cycle using a copy
to vgpr inside of the cycle
Add very detailed set of regression tests for cases mentioned above.
patch 1 from: https://github.com/llvm/llvm-project/pull/73337
These tests rely on SCEV looking recognizing an "or" with no common
bits as an "add". Add the disjoint flag to relevant or instructions
in preparation for switching SCEV to use the flag instead of the
ValueTracking query. The IR with disjoint flag matches what
InstCombine would produce.