Return poison instead of undef for non-demanded lanes in the AMDGPU
demanded element simplification hook.
Also bail out of dmask is 0, as this case has special semantics:
> If DMASK==0, the TA overrides DMASK=1 and puts zeros in VGPR followed by
> LWE status if exists. TFE status is not generated since the fetch is dropped.
It is currently disabled by default. It will need experiments on a real
HW to tune and decide on the profitability.
---------
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
This will result in larger atomic operations getting expanded to
`__atomic_*` libcalls via AtomicExpandPass, which matches what Clang
already does in the frontend.
While AMDGPU currently disables the use of all libcalls, I've changed it
to instead disable all of them _except_ the atomic ones. Those are
already be emitted by the Clang frontend, and enabling them in the
backend allows the same behavior there.
1) It was marked as volatile. This is not needed and the only reason
it was done is because it is both load and store and handled
together with atomics. Global load to LDS was marked as volatile
just because buffer load was done that way.
2) Preserve at least LDS (store) pointer which we always have with
the intrinsics.
3) Use PoisonValue instead of nullptr for load memop as a Value.
SIInstrInfo::areMemAccessesTriviallyDisjoint does a DS offset checks,
but does not account for LDS DMA instructions. Added these checks.
Without it code falls through and returns true which is wrong. As a
result mayAlias would always return false for LDS DMA and a regular LDS
instruction or 2 LDS DMA instructions.
At the moment this is NFCI because we do not use this AA in a context
which may touch LDS DMA instructions. This is also unreacheable now
because of the ordered memory ref checks just above in the function and
LDS DMA is marked as volatile. This volatile marking is removed in PR
#75247, therefore I'd submit this check before #75247.
This means that getRegInterval no longer depends on the MCInstrDesc, so
it could be simplified further to take just a MachineOperand or just a
physical register. NFCI.
This was caught by coverity, reported as: `dead_error_condition`.
Since the conditional revolves around `CF`, it is guaranteed to be null
in the else clause, hence making the second part of the statement
redundant.
Physical VGPRs used for SGPR spills need to be tracked independent of
WWM reserved registers. The WWM reserved set contains extra registers
allocated during WWM pre-allocation pass.
This causes SGPR spills allocated after WWM pre-allocation to overlap
with WWM register usage, e.g. if frame pointer is spilt during
prologue/epilog insertion.
This is an experimental address space for strided buffers. These buffers
can have structs as elements and
a stride > 1.
These pointers allow the indexed access in units of stride, i.e., they
point at `buffer[index * stride]`.
Thus, we can use the `idxen` modifier for buffer loads.
We assign address space 9 to 192-bit buffer pointers which contain a
128-bit descriptor, a 32-bit offset and a 32-bit index. Essentially,
they are fat buffer pointers with an additional 32-bit index.
Make abstract class PhiLoweringHelper and expose it for use in
GlobalISel path.
SILowerI1Copies implements PhiLoweringHelper as Vreg1LoweringHelper and
it is equivalent to SILowerI1Copies.
Notable change that createLaneMaskReg now clones attributes from
register that has lane mask attributes instead of creating register with
lane mask register class. This is because lane masks have
different(more) attributes in GlobalISel.
patch 2 from: https://github.com/llvm/llvm-project/pull/73337
No GFX12 encoding was added for these. This patch adds tests that they
are not recognized by the assembler and defends against generating them
in codegen.
Add empty AMDGPUGlobalISelDivergenceLowering pass. This pass will
implement
- selection of divergent i1 phis as lane mask phis, requires lane mask
merging in some cases
- lower uses of divergent i1 values outside of the cycle using lane mask
merging
- lowering of all cases of temporal divergence:
- lower uses of uniform i1 values outside of the cycle using lane mask
merging
- lower uses of uniform non-i1 values outside of the cycle using a copy
to vgpr inside of the cycle
Add very detailed set of regression tests for cases mentioned above.
patch 1 from: https://github.com/llvm/llvm-project/pull/73337
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.
I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.
The user can provide multiple predicators to MacroFusion and the
DAG mutation will be applied if one of them is evalated to true.
`ShouldSchedulePredTy` is renamed to `MacroFusionPredTy`.
The user can provide multiple predicators to MacroFusion and the
DAG mutation will be applied if one of them is evalated to true.
`ShouldSchedulePredTy` is renamed to `MacroFusionPredTy`.