When using the `amdgcn.init.whole.wave` intrinsic, we add dummy VGPR
arguments with the purpose of preserving their inactive lanes. The
pattern may look something like this:
```
entry:
call amdgcn.init.whole.wave
branch to shader or tail
shader:
$vInactive = IMPLICIT_DEF ; Tells regalloc it's safe to use the active lanes
actual code...
tail:
call amdgcn.cs.chain [...], implicit $vInactive
```
We should not report these VGPRs in the `.vgpr_count` metadata. This
patch achieves that goal by ignoring meta instructions and calls. This should
be safe since if those registers are actually used in any other context,
they will be counted there. The same reasoning applies in the general
case, so we don't explicitly check for the existence of `init.whole.wave`.
This is a reworked version of #133242, which was reverted in #144039
and split into smaller bits.
Cause:
1. `implicit_def` inside bundle does not count for define of reg in
machineinst verifier
2. Including `implicit_def` will cause relative reg not define, result
in `Bad machine code: Using an undefined physical register` in the
machineinst verifier
Fixes https://github.com/llvm/llvm-project/issues/139102
---------
Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
Constructing Target triple with `ObjectFile::makeTriple` instead of just
with `Arch` and leaving the rest unknown. Also creating the subtarget
with the `CPU`. AMDGPU needs the full triple and `CPU` to disassemble
correctly.
To run a full test, also fixed a failure in `SIPreAllocateWWMRegs` with
the `$noreg` operand in `DBG_VALUE`.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
In review of bbde6b, I had originally proposed that we support the
legacy text format. As review evolved, it bacame clear this had been a
bad idea (too much complexity), but in order to let that patch finally
move forward, I approved the change with the variant. This change undoes
the variant, and updates all the tests to just use the array form.
Not quite NFC as it looks like the original intrinsic-handling code
never got updated to use records. This was never caught because that
code wasn't tested. I've adjusted an existing test so the behaviour is
now covered.
This PR depends on #148165; the first commit
(90f1d0a881a21a8b4f192622d798c290770fda63) belongs to that PR. The
changes are distinct, so separate PRs seemed like the best option. I
don't have commit access, so I couldn't use user-branches to mark the
dependency.
As AMDGPULateCodeGenPrepare actually performs changes that invalidate
Uniformity Analysis; use `setPreservesCFG()` to mark this, instead of
`setPreservesAll()` which wrongly includes preserving Uniformity
Analysis.
Note that before #148165, this would still have preserved Uniformity
Analysis, hence the dependency. In addition, `amdgpu/llc-pipeline.cc`
needs to be changed when both changes are in effect, but those changes
would make the test fail if the PRs weren't based on one another.
Note on why this hasn't caused issues so far:
It just so happens that AMDGPULateCodeGenPrepare is always immediately
followed by AMDGPUUnifyDivergentExitNodes, which *does* invalidate most
analyses, including Uniformity. And because UnifyDivergentExitNodes only
looks at terminators, and LateCGP seemingly does not replace uniform
values with divergent values, or divergent values with uniform values,
and it only *inserts new values that are not looked at by
UnifyDivergentExitNodes*, this bug remained hidden.
---
I ran `git-clang-format` on my changes. I tested them using the
`check-llvm` target; no unexpected failures occurred after I made the
change to `amdgpu/llc-pipeline.ll`.
For flat memory instructions where the address is supplied as a base address
register with an immediate offset, the memory aperture test ignores the
immediate offset. Currently, ISel does not respect that, which leads to
miscompilations where valid input programs crash when the address computation
relies on the immediate offset to get the base address in the proper memory
aperture. Global or scratch instructions are not affected.
This patch only selects flat instructions with immediate offsets from address
computations with the inbounds flag: If the address computation does not leave
the bounds of the allocated object, it cannot leave the bounds of the memory
aperture and is therefore safe to handle with an immediate offset.
Relevant tests are in fold-gep-offset.ll.
Analogous to #132353 for SDAG (which is not yet in a mergeable state, its
progress is currently blocked by #146076).
Fixes SWDEV-516125 for GISel.
This is in preparation for a patch that disables folding offsets into FLAT
instructions if the corresponding address computation is not inbounds, to avoid
miscompilations where this would lead to wrong aperture check results.
With the added inbounds flags for GEPs and G_PTR_ADDs affecting FLAT
instructions, the outputs for these tests won't change.
For SWDEV-516125.
Add a new constraint corresponding to the AV_* register classes
for operands which can allocate AGPRs or VGPRs. This applies
to load and stores on gfx90a+, and srcA / srcB for MFMA instructions.
The error emitted on unsupported targets isn't ideal, it is
produced by the register allocator without a rationale, but it is
consistent with the existing errors.
I mostly want this for writing allocation tests.
The `RPTarget`'s way of determining whether VGPRs are beneficial to save
and whether the target has been reached w.r.t. VGPR usage currently
assumes, if `CombinedVGPRSavings` is true, that free slots in one VGPR
RC can always be used for the other. Implicitly, this makes the
rematerialization stage (only current user of `RPTarget`) follow a
different occupancy calculation than the "regular one" that the
scheduler uses, one that assumes that ArchVGPR/AGPR usage can be
balanced perfectly and at no cost, which is untrue in general. This
ultimately yields suboptimal rematerialization decisions that require
cross-VGPR-RC copies unnecessarily.
This fixes that, making the `RPTarget`'s internal model of occupancy
consistent with the regular one. The `CombinedVGPRSavings` flag is
removed, and a form of cross-VGPR-RC saving implemented only for unified
RFs, which is where it makes the most sense. Only when the amount of
free VGPRs in a given VGPR RC (ArchVPGR or AGPR) is lower than the
excess VGPR usage in the other VGPR RC does the `RPTarget` consider that
a pressure reduction in the former will be beneficial to the latter.
When computing the number of registers required by entry functions, the
`AMDGPUAsmPrinter` needs to take into account both the register usage
computed by the `AMDGPUResourceUsageAnalysis` pass, and the number
of registers initialized by the hardware. At the moment, the way it
computes the latter is different for graphics vs compute, due to differences in
the implementation. For kernels, all the information needed is available in
the `SIMachineFunctionInfo`, but for graphics shaders we would iterate over
the `Function` arguments in the `AMDGPUAsmPrinter`. This pretty much
repeats some of the logic from instruction selection.
This patch introduces 2 new members to `SIMachineFunctionInfo`, one
for SGPRs and one for VGPRs. Both will be computed during instruction
selection and then used during `AMDGPUAsmPrinter`, removing the need
to refer to the `Function` when printing assembly.
This patch is NFC except for the fact that we now add the extra SGPRs
(VCC, XNACK etc) to the number of SGPRs computed for graphics entry points.
I'm not sure why these weren't included before. It would be nice if
someone could confirm if that was just an oversight or if we have some docs
somewhere that I haven't managed to find. Only one test is affected (its SGPR
usage increases because we now take into account the XNACK registers).
Now that #149310 has restricted lifetime intrinsics to only work on
allocas, we can also drop the explicit size argument. Instead, the size
is implied by the alloca.
This removes the ability to only mark a prefix of an alloca alive/dead.
We never used that capability, so we should remove the need to handle
that possibility everywhere (though many key places, including stack
coloring, did not actually respect this).
Sec. 4.6.7.1 of the gfx1250 SPG states that if an SGPR is used
as an operand, only one SGPR will be read for both the low and high
operations. As a result, the corresponding bits in `op_sel` and
`op_sel_hi` must be the same when the operand is an SGPR.
Co-authored-by: Tian, Shilei <Shilei.Tian@amd.com>
Co-authored-by: Tian, Shilei <Shilei.Tian@amd.com>
Change from GFX12: Relax S_CLAUSE rules to all all non-flat memory types
in
the same clause, and all Flat types in the same.
For VMEM/FLAT clause types now look like:
- Non-Flat (load, store, atomic): buffer, global, scratch, TDM, Async
- Flat: load, store, atomic
Add to the VOP patterns to recognise when or/xor/and are masking only the most significant bit of i32/v2i32/i64 and replace with the corresponding FP source modifier.
This makes the test actually test the intended rewrite
pass. Also add some tests with inline immediates in src2.
Switch the target to gfx942 for future test functions.
Test other splitting situations that appear in greedy.
This includes ensuring we have a case that hits a local split
and instruction split (most of the tests hit the region split path).
Also test a few cases where the final result isn't fully used, resulting
in partial copy bundles instead of a simple full copy. Test physreg
and virtreg agpr interference with a reassignment candidate.
I'm accumulating too many failure cases, and MIR tests are very prone
to painful merge conflicts, so I've added a few more tests and extracted
new tests from #147975.
Closes#149026
Do not fold an immediate into an instruction that already has a frame
index operand. A frame index could possibly turn out to be another immediate.
Fixes: SWDEV-536263
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Add the llvm.amdgcn.call.whole.wave intrinsic for calling whole wave
functions. This will take as its first argument the callee with the
amdgpu_gfx_whole_wave calling convention, followed by the call
parameters which must match the signature of the callee except for the
first function argument (the i1 original EXEC mask, which doesn't need
to be passed in). Indirect calls are not allowed.
Make direct calls to amdgpu_gfx_whole_wave functions a verifier error.
Unspeakable horrors happen around calls from whole wave functions, the
plan is to improve the handling of caller/callee-saved registers in
a future patch.
Tail calls are also handled in a future patch.
Now that #146490 removed the assertion in visitFreeze to assert that the
node was still isGuaranteedNotToBeUndefOrPoison we no longer need this
reduced depth hack (which had to account for the difference in depth of
freeze(op()) vs op(freeze())
Helps with some of the minor regressions in #150017
Similar to InstCombinerImpl::freezeOtherUses, attempt to ensure that we
merge multiple frozen/unfrozen uses of a SDValue. This fixes a number of
hasOneUse() problems when trying to push FREEZE nodes through the DAG.
Remove SimplifyMultipleUseDemandedBits handling of FREEZE nodes as we
now want to keep the common node, and not bypass for some nodes just
because of DemandedElts.
Fixes#149799
Fixes#139791.
When trying to combine two G_UNMERGE_VALUES with a COPY between them,
a crash occurs in tryCombineUnmergeValues() because getDefIndex() tries
to find the index of the original source register in the def found by
getDefIgnoringCopies(). In the case of a COPY in between, this register
is not present in the def, only in the COPY.