This allows more accurate alias analysis to apply at the bundle level.
This has a bunch of minor effects in post-RA scheduling that look mostly
beneficial to me, all of them in AMDGPU (the Thumb2 change is cosmetic).
The pre-existing (and unchanged) test in
CodeGen/MIR/AMDGPU/custom-pseudo-source-values.ll tests that MIR with a
bundle with MMOs can be parsed successfully.
v2:
- use cloneMergedMemRefs
- add another test to explicitly check the MMO bundling behavior
v3:
- use poison instead of undef to initialize the global variable in the
test
(#131759)
This isn't really the right check, we want to know that the intrinsic
does not perform a true function call to any code (in the module or
not). nocallback
appears to be the closest thing to this property we have now though.
Fixes theoretically
miscompiles with intrinsics like statepoint, which hide a call to a real
function.
Also do the same for inferring no-agpr usage.
Currently MachineVerifier is missing verifying early-clobber operand
constraint.
The only other machine operand constraint - TiedTo is already verified.
This is in preparation of a future AMDGPU change where we are going to
create bundles before register allocation and want to rely on the
TwoAddressInstructionPass handling those bundles correctly.
v2:
- simplify the virtual register check and the test
The WMMA MI(s) are missing the isConvergent flag. This causes incorrect
behavior in passes like machine-sink, where WMMA instructions get sunk
into divergent branches.
This patch fixes the issue by setting the isConvergent flag to 1 in the
VOP3PInstructions.td file.
Also add a corresponding intrinsic property that can be used to mark
intrinsics that do not introduce poison, for example simple arithmetic
intrinsics that propagate poison just like a simple arithmetic
instruction.
As a smoke test this patch adds the new property to
llvm.amdgcn.fmul.legacy.
Since #163011 changed AMDGPU to use ELF mangling, the regexp failed to
match private functions because of the inconsistent presence/absence of
the .L prefix on the first line of the function e.g.:
```
.Lfoo: ; @foo
```
In case there is an dynamic alloca / an alloca which is not in the entry
block, cs.chain functions do not set up an FP, but are reported to need
one. This results in a failed assertion in
`SIFrameLowering::emitPrologue()` (Assertion `(!HasFP || FPSaved) &&
"Needed to save FP but didn't save it anywhere"' failed.) This commit
changes `hasFPImpl` so that the need for an SP in a cs.chain function
does not directly imply the need for an FP anymore.
This LLVM defect was identified via the AMD Fuzzing project.
By convention a basic block shall start with MSBs zero. We also
need to know a previous mode in all cases as SWDEV-562450 asks
to record the old mode in the high bits of the mode.
This patch adds register bank legalization support for G_FADD opcodes in
the AMDGPU GlobalISel pipeline.
Added new reg bank type UniInVgprS64.
This patch also adds a combine logic for ReadAnyLane + Trunc + AnyExt.
---------
Co-authored-by: Abhinav Garg <abhigarg@amd.com>
We do not have native instructions for direct bfloat comparisons.
However, we can expand bfloat to float, and do float comparison instead.
TODO: handle bfloat comparison for ballot intrinsic on global isel path.
Fixes: SWDEV-563403
When selecting for G_AMDGPU_COPY_SCC_VCC, we use S_CMP_LG_U64 or
S_CMP_LG_U32 for wave64 and wave32 respectively. However, on gfx7 we do
not have the S_CMP_LG_U64 instruction. Work around this issue by using
S_OR_B64 instead.
Unused loop invariant loads were not sunk from the preheader to the exit
block, increasing live range.
This commit moves the sinkUnusedInvariant logic from indvarsimplify to
LICM also adds functionality to sink unused load that's not
clobbered by the loop body.
This PR enables AMDGPUUniformIntrinsicCombine pass in the llc pipeline.
Also introduces the "amdgpu-uniform-intrinsic-combine" command-line flag
to enable/disable the pass.
see the PR:https://github.com/llvm/llvm-project/pull/116953
There has been an issue(using function analysis inside the module pass
in OPM) integrating this pass into the LLC pipeline, which currently
lacks NPM support. I tried finding a way to get the per-function
analysis, but it seems that in OPM, we don't have that option.
So the best approach would be to make it a function pass.
Ref: https://github.com/llvm/llvm-project/pull/116953
Add support for no-return variants of image atomic operations
(e.g. IMAGE_ATOMIC_ADD_NORTN, IMAGE_ATOMIC_CMPSWAP_NORTN).
These variants are generated when the return value of the intrinsic is
unused, allowing the backend to select no return type instructions.
This helps clean up some more legalization artefacts during
legalization, in a similar way to other operations, and helps some of
the DUP cases get through legalization successfully.
Apply additional counter waits to address VALU writes to SGPRs. Rework
expiry detection and apply wait coalescing to mitigate some of the
additional waits.
Generate s_absdiff_i32. Tested in absdiff.ll. Also update s_cmp_0.ll to
test that s_absdiff_i32 is foldable with a s_cmp_lg_u32 sX, 0.
---------
Signed-off-by: John Lu <John.Lu@amd.com>
When lowering spills / restores, we may end up partially lowering the
spill via copies and the remaining portion with loads/stores. In this
partial lowering case,the implicit-def operands added to the restore
load clobber the preceding copies -- telling MachineCopyPropagation to
delete them. By also attaching an implicit operand to the load, the
COPYs have an artificial use and thus will not be deleted - this is the
same strategy taken in https://github.com/llvm/llvm-project/pull/115285
I'm not sure that we need implicit-def operands on any load restore, but
I guess it may make sense if it needs to be split into multiple loads
and some have been optimized out as containing undef elements.
These implicit / implicit-def operands continue to cause correctness
issues. A previous / ongoing long term plan to remove them is being
addressed via:
https://discourse.llvm.org/t/llvm-codegen-rfc-add-mo-lanemask-type-and-a-new-copy-lanemask-instruction/88021https://github.com/llvm/llvm-project/pull/151123https://github.com/llvm/llvm-project/pull/151124
Cache extracted elements in lowerShuffleVector(). For example, when
lowering
```
%0:_(<2 x s32>) = G_BUILD_VECTOR %0, %1
%2:_(<N x s32>) = G_SHUFFLE_VECTOR %1, shufflemask(0, 0, 0, 0 ... x N )
```
Currently, we generate `N` `G_EXTRACT_VECTOR_ELT` for each element in
shufflemask. This is undesirable and bloats the code, especially for
larger vectors.
With this change, we only generate one `G_EXTRACT_VECTOR_ELT` from `%0`
and reuse it for all four result elements.
If we only deal with a one part of 64bit value we can just generate
merge and unmerge which will be either combined away or
selected into copy / mov_b32.
Add GlobalISel lowering of G_FMINIMUM and G_FMAXIMUM following the same
logic as in SDag's expandFMINIMUM_FMAXIMUM.
Update AMDGPU legalization rules: Pre GFX12 now uses new lowering method
and make G_FMINNUM_IEEE and G_FMAXNUM_IEEE legal to match SDag.
I'm not sure if this is the best way forward or not, but we have a lot
of issues with forgetting that shuffle_vectors can be scalar again and
again. (There is another example from the recent known-bits code added
recently). As a scalar-dst shuffle vector is just an extract, and a
scalar-source shuffle vector is just a build vector, this patch makes
scalar shuffle vector illegal and adjusts the irbuilder to create the
correct node as required.
Most targets do this already through lowering or combines. Making scalar
shuffles illegal simplifies gisel as a whole, it just requires that
transforms that create shuffles of new sizes to account for the scalar
shuffle being illegal (mostly IRBuilder and LessElements).