The default maximum waves/EU returned by the family of
`AMDGPUSubtarget::getWavesPerEU` is currently the maximum number of
waves/EU supported by the subtarget (only a valid occupancy range in
"amdgpu-waves-per-eu" may lower that maximum). This ignores maximum
achievable occupancy imposed by flat workgroup size and LDS usage,
resulting in situations where `AMDGPUSubtarget::getWavesPerEU` produces
a maximum higher than the one from
`AMDGPUSubtarget::getOccupancyWithWorkGroupSizes`.
This limits the waves/EU range's maximum to the maximum achievable
occupancy derived from flat workgroup sizes and LDS usage. This only has
an impact on functions which restrict flat workgroup size with
"amdgpu-flat-work-group-size", since the default range of flat workgroup
sizes achieves the maximum number of waves/EU supported by the
subtarget.
Improvements to the handling of "amdgpu-waves-per-eu" are left for a
follow up PR (e.g., I think the attribute should be able to lower the
full range of waves/EU produced by these methods).
Fixes https://github.com/llvm/llvm-project/issues/130020
This fixes an issue where the si-fold-operands pass would incorrectly
fold immediate values into COPY instructions targeting av_32 registers.
The pass now checks register class constraints before attempting to fold
the immediate.
Since e39f6c1844fab59c638d8059a6cf139adb42279a opt will infer the
correct datalayout when given a triple. Avoid explicitly specifying it
in tests that depend on the AMDGPU target being present to avoid the
string becoming out of sync with the TargetInfo value.
Only tests with REQUIRES: amdgpu-registered-target or a local lit.cfg
were updated to ensure that tests for non-target-specific passes that
happen to use the AMDGPU layout still pass when building with a limited
set of targets.
Reviewed By: shiltian, arsenm
Pull Request: https://github.com/llvm/llvm-project/pull/137921
Fix issue 131298 where an undefined $scc register causes verifier errors
when using SI_KILL_F32_COND_IMM_TERMINATOR instructions. The problem
occurs because the $scc register defined in a comparison before the kill
terminator is used in successor blocks, but was not properly marked as live-in.
This patch:
- Adds code to check if SCC is used in the successor block
- Adds SCC as a live-in to successor blocks
- Handles both explicit and implicit uses of SCC
With this patch the machine verifier no longer reports undefined $scc
errors in following kill terminator instruction.
Fixes#131298
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Currently, if there is already noalias metadata present on loads and
stores, lower module lds pass is generating a more conservative aliasing
set. This results in inhibiting scheduling intrinsics that would have
otherwise generated a better pipelined instruction.
The fix is not to always intersect already existing noalias metadata
with noalias created for lowering of LDS. But to intersect only if
noalias scopes are from the same domain, otherwise concatenate exising
noalias sets with LDS noalias.
There a few patches that have come for scopedAA in the past. Following
three should be enough background information.
https://reviews.llvm.org/D91576https://reviews.llvm.org/D108315https://reviews.llvm.org/D110049
Essentially, after a pass that might change aliasing info, one should
check if that pass results in change number of MayAlias or ModRef using
the following:
`opt -S -aa-pipeline=basic-aa,scoped-noalias-aa -passes=aa-eval
-evaluate-aa-metadata -print-all-alias-modref-info -disable-output`
When floating-point operations are legalized to operations of a higher
precision (e.g. f16 fadd being legalized to f32 fadd) then we get
narrowing then widening operations between each operation. With the
appropriate fast math flags (nnan ninf contract) we can eliminate these
casts.
We currently just need to shift down 32bit wwm registers.
Previous check condition mistakenly select 16bit registers in true16
mode. Update check condition to skip the 16bit register in wmm reg
sorting
This PR updates the `Verifier` to enforce that `alloca` instructions on
AMDGPU must be in AS5. This prevents hitting a misleading backend error
like "unable to select FrameIndex," which makes it look like a backend
bug when it's actually an IR-level issue.
Add support for using the existing SCRATCH_STORE_BLOCK and
SCRATCH_LOAD_BLOCK instructions for saving and restoring callee-saved
VGPRs. This is controlled by a new subtarget feature, block-vgpr-csr. It
does not include WWM registers - those will be saved and restored
individually, just like before. This patch does not change the ABI.
Use of this feature may lead to slightly increased stack usage, because
the memory is not compacted if certain registers don't have to be
transferred (this will happen in practice for calling conventions where
the callee and caller saved registers are interleaved in groups of 8).
However, if the registers at the end of the block of 32 don't have to be
transferred, we don't need to use a whole 128-byte stack slot - we can
trim some space off the end of the range.
In order to implement this feature, we need to rely less on the
target-independent code in the PrologEpilogInserter, so we override
several new methods in SIFrameLowering. We also add new pseudos,
SI_BLOCK_SPILL_V1024_SAVE/RESTORE.
One peculiarity is that both the SI_BLOCK_V1024_RESTORE pseudo and the
SCRATCH_LOAD_BLOCK instructions will have all the registers that are not
transferred added as implicit uses. This is done in order to inform
LiveRegUnits that those registers are not available before the restore
(since we're not really restoring them - so we can't afford to scavenge
them). Unfortunately, this trick doesn't work with the save, so before
the save all the registers in the block will be unavailable (see the
unit test).
This was reverted due to failures in the builds with expensive checks
on, now fixed by always updating LiveIntervals and SlotIndexes in
SILowerSGPRSpills.
This removes the need to explicitly set isTruncStore on truncstorei8 and
other similar PatFrags that include truncstore in their frags DAG.
This allows some new patterns to be imported for AMDGPU as you can see
in the changed test.
The extra isTruncStore were added in ae2b36e8bdfa6, along with some
other tablegen changes to look for MemoryVT along with isTruncStore. I
did not remove the code, because I'm not sure if any out of tree users
have become dependent on it. It's no longer exercised in tree.
This assert was failing in a fuzzing test. I consulted with @jrbyrnes
who said:
The MFMASmallGemmSingleWaveOpt::apply() method is invoked if and only if
the user has inserted an intrinsic llvm.amdgcn.iglp.opt(i32 1) into
their source code. This intrinsic applies a highly specialized DAG
mutation to result in specific scheduling for a specific set of kernels.
These assertions are really just confirming that the characteristics of
the kernel match what is expected (i.e. The kernels are similar to the
ones this DAG mutation strategy were designed against).
However, if we apply this DAG mutation to kernels for which is was not
designed, then we may not find the types of instructions we are looking
for, and may end up with empty caches.
I think it should be fine to just return false if the cache is empty
instead of the assert.
This is a NFC patch.
This patch run a bulk update on CodeGen tests that are impacted by the
true16 features. This patch applies:
1. duplicate GFX11plus runlines and apply them with
"+mattr=+real-true16" and "+mattr=-real-true16"
2. update the test with the update script
For some GISEL runlines, the current CodeGen do not fully support the
true16 version. Still update the runlines, but comment out the failing
one, and added a "FIXME-TRUE16" comment to that test for easier
tracking. These test will be fixed in the following patches.
This is in a transition state that we support both
"+real-true16/-real-true16" in our code base. We plan to move to
"+real-true16" as default, and finally remove "-real-true16" mode and
test lines.
Add support for using the existing `SCRATCH_STORE_BLOCK` and
`SCRATCH_LOAD_BLOCK` instructions for saving and restoring callee-saved
VGPRs. This is controlled by a new subtarget feature, `block-vgpr-csr`.
It does not include WWM registers - those will be saved and restored
individually, just like before. This patch does not change the ABI.
Use of this feature may lead to slightly increased stack usage, because
the memory is not compacted if certain registers don't have to be
transferred (this will happen in practice for calling conventions where
the callee and caller saved registers are interleaved in groups of 8).
However, if the registers at the end of the block of 32 don't have to be
transferred, we don't need to use a whole 128-byte stack slot - we can
trim some space off the end of the range.
In order to implement this feature, we need to rely less on the
target-independent code in the PrologEpilogInserter, so we override
several new methods in `SIFrameLowering`. We also add new pseudos,
`SI_BLOCK_SPILL_V1024_SAVE/RESTORE`.
One peculiarity is that both the SI_BLOCK_V1024_RESTORE pseudo and the
SCRATCH_LOAD_BLOCK instructions will have all the registers that are not
transferred added as implicit uses. This is done in order to inform
LiveRegUnits that those registers are not available before the restore
(since we're not really restoring them - so we can't afford to scavenge
them). Unfortunately, this trick doesn't work with the save, so before
the save all the registers in the block will be unavailable (see the
unit test).
wb/wbinv use storecnt, inv uses loadcnt.
Track them as VMEM_WRITE_ACCESS and VMEM_READ_ACCESS to avoid
InsertWaitCnt incorrectly eliminating the waitcnts after these instructions.
Solves SWDEV-526604
SIInstrInfo::resultDependsOnExec assumes that operand 0 of a comparison
is always the destination of the instruction. This is not true for
instructions in VOPC form where it is "src0". This led to a crash in
machine-cse.
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
The PR will fix the issue
https://github.com/llvm/llvm-project/issues/122728
This patch addresses the signed/zero extension of poison by using a
poison value of the extended type instead of a constant zero of the
extended type.
This avoids the need to have special handling at every use site.
Unfortunately this means we unnecessarily emit AssertZext in the DAG
(where we already directly understand the range of the intrinsic), andt
we regress in undefined cases as we don't fold out asserts on undef.
On targets that support v_cvt_pk_f16_f32 instruction, if we make v2f64
-> v2f16 Legal, we will generate the following sequence of instructions:
v_cvt_f32_f64_e32 v1, s[6:7]
v_cvt_f32_f64_e32 v2, s[4:5]
v_cvt_pk_f16_f32 v1, v2, v1
It possibly returns imprecise results due to double rounding. This patch
fixes the issue by not setting the conversion Legal. While we may still
expect the above sequence of code when unsafe fpmath is set, I hope
https://github.com/llvm/llvm-project/pull/134738 can address that
performance concern.
Fixes: SWDEV-523856
Re-apply https://github.com/llvm/llvm-project/pull/130577
Which is reverted in https://github.com/llvm/llvm-project/pull/133880
The old application failed in address sanitizer due to
`tryNarrowMathIfNoOverflow` was called after `I.eraseFromParent();` in
`AMDGPUCodeGenPrepareImpl::visitBinaryOperator`, it create a use after
free failure.
To fix this, `tryNarrowMathIfNoOverflow` will be called before and
directly return if `tryNarrowMathIfNoOverflow` result in true.
Previously the AnnotateKernelFeatures pass infers two attributes:
amdgpu-calls and amdgpu-stack-objects, which are used to help determine
if flat scratch init is allowed. PR #118907 created the
amdgpu-no-flat-scratch-init attribute. Continuing with that work, this
patch makes use of this attribute to determine flat scratch init,
replacing amdgpu-calls and amdgpu-stack-objects. This also leads to the
removal of the AnnotateKernelFeatures pass.