The mov64 pseudo is split into two 32 bit movs, but those 32 bit movs
had the full 64-bit register still implicitly defined. VOPD formation is
affected, so we can emit more of them.
Add ThreeOp_v2i32_Pats pattern class to support v2i32 vector operations
for AND_OR_B32 and OR3_B32 instructions. The new patterns check the
v2i32 and-or or or-or instruction sequence, extract individual 32-bit
elements from v2i32 operands, and applies the and_or or or3 vop3
operations.
This issue was discovered during some downstream work around Vulkan CTS
tests, specifically
`dEQP-VK.subgroups.arithmetic.compute.subgroupadd_float`
---------
Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
Propagate demanded bits through readfirstlane intrinsic in
AMDGPUISelLowering with SimplifyDemandedBitsForTargetNode
implementation.
This allows upstream zero/sign extensions to be eliminated when only a
subset of bits is used after the intrinsic.
Partially addresses #128390.
addAliasScopeMetadata in AMDGPULowerKernelArguments skips instructions
with empty PtrArgs, including memory-accessing calls that have no
pointer arguments (e.g. builtins like threadIdx()). Because these calls
never receive !noalias metadata, ScopedNoAliasAA cannot prove they don't
alias noalias kernel arguments. MemorySSA then conservatively reports
them as clobbers, which prevents AMDGPUAnnotateUniformValues from
marking loads as noclobber, blocking scalarization (s_load) and forcing
expensive vector loads (global_load) instead.
Fix by adding all noalias kernel argument scopes to !noalias metadata
for memory-accessing instructions with no pointer arguments. Since such
instructions cannot access memory through any kernel pointer argument,
all noalias scopes are safe to apply.
This fixes a performance regression in rocFFT introduced by bd9668df0f00
("[AMDGPU] Propagate alias information in AMDGPULowerKernelArguments").
Assisted-by: Claude Opus
Moving these into the middle-end pipeline will allow for additional
optimization of the expansion result, such as CSE of redundant loads
(c.f. https://godbolt.org/z/bEna4Md9r). For now, we conservatively place
the passes at the end of the middle-end pipeline, so we mostly don't
benefit from additional optimizations yet. The pipeline position will be
moved in a future change.
This builds on work done by legrosbuffle in
https://reviews.llvm.org/D60318.
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Do not remove IMPLICIT_DEF of a physreg unless all uses have an undef
flag added. Previously, only the first use instruction had undef flags
added. This will cause a failure in machine instruction verification.
Multi-instruction uses tested in AMDGPU/multi-use-implicit-def.mir and
X86/multi-use-implicit-def.mir.
---------
Signed-off-by: John Lu <John.Lu@amd.com>
When spill slots are eliminated (VGPR-to-AGPR, SGPR-to-VGPR lanes),
debug values referencing these frame indices were not always properly
cleaned up. This caused an assertion failure in getObjectOffset() when
PrologEpilogInserter tried to access the offset of a dead frame object.
The existing debug fixup code in SIFrameLowering and SILowerSGPRSpills
had two limitations:
1. It only checked one operand position, but DBG_VALUE_LIST instructions
can have multiple debug operands with frame indices.
2. It didn't handle all types of dead frame indices uniformly.
Fix by centralizing debug info cleanup in removeDeadFrameIndices(),
which already knows all frame indices being removed. This iterates over
all debug operands using MI.debug_operands().
Assisted-by: Claude Code.
In SIFoldOperands, folding `or x, -1` to `v_mov_b32 -1` removed
`Src1Idx`, which is incorrect because `-1` is in `Src0Idx` (after
canonicalization).
Closes https://github.com/llvm/llvm-project/issues/189677.
Disable generic DAG combines for AMDGPU at -O0 via
disableGenericCombines() to preserve instructions that users may want to
set breakpoints on during debugging.
Assisted-by: Cursor / Claude Opus 4.6
[This
case](f380a878d5/llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp (L343-L345))
is not covered by any existing tests (checked using code coverage and by
inserting an `abort` at that line). I propose a new test that tests this
line.
This is demonstrated by showing that it is the only test that fails in
the presence of the `abort`.
This is a second attempt at "[SelectionDAG] Expand
CTTZ_ELTS[_ZERO_POISON] and handle splitting" (#188220)
That PR had to be reverted in 7d39664a6ae8daaf186b65578492244d96a50bf2
because we had crashes on AMDGPU since we didn't have scalarization
support, and other crashes on PowerPC because we didn't handle the case
when a vector needed widened. Tests for these are added in
AMDGPU/cttz-elts.ll, RISCV/rvv/cttz-elts-scalarize.ll and
PowerPC/cttz-elts.ll.
The former crash has been fixed by adding
DAGTypeLegalizer::ScalarizeVecOp_CTTZ_ELTS.
The second crash has been fixed by reworking
TargetLowering::expandCttzElts. The expansion for CTTZ_ELTS is nearly
identical to VECTOR_FIND_LAST_ACTIVE, except it uses a reverse step
vector and subtracts the result from VF. The easiest way to fix these
crashes without introducing regressions is to reuse the
VECTOR_FIND_LAST_ACTIVE expansion which already handles the case where
the vector needs widened.
This means that the node now needs to take in a boolean vector argument
and uses VSELECT instead of an AND to zero out inactive lanes, so the op
promotion code has also been shared.
A fract implementation can equivalently be written as
r = fmin(x - floor(x))
r = isnan(x) ? x : r;
r = isinf(x) ? 0.0 : r;
or:
r = fmin(x - floor(x));
r = isinf(x) ? 0.0 : r;
r = isnan(x) ? x : r;
Previously this only matched the previous form. Match
the case where the isinf check is the inner clamp. There are
a few more ways to write this pattern (e.g., move the clamp of
infinity to the input) but I haven't encountered that in the wild.
The existing code seems to be trying too hard to match noncanonical
variants of the pattern. Only handles the result that all 4 permutations
of compare and select produce out of instcombine.
Remove ConstantFPSDNode handling from isKnownNeverNaN and fallback to
using computeKnownFPClass if there are no opcode matches in
isKnownNeverNaN
The test check changes are due to isKnownNeverNaN not handling
UNDEF/POISON but computeKnownFPClass does (POISON in particular now
returns isKnownNeverNaN == true, preventing a ISD::FCANONICALIZE call in
expandFMINNUM_FMAXNUM).
Reland https://github.com/llvm/llvm-project/pull/184929 after fixing
some issues in the NDEBUG builds.
3a640ee is unchanged from the previously approved PR, the unreviewed
portion of this PR is 9cabd8d
The test used to look all good, but actually not. The WeakVH just make
itself null after the pointed value being replaced. So a zero value was
used because VarIndex become null. The test checks looks all good.
Actually only the WeakTrackingVH have the ability to be updated to new
value.
Change the test slightly to make that using zero index is wrong.
Adds basic support for new heuristics for the CoExecSchedStrategy.
InstructionFlavor provides a way to map instructions to different
"Flavors". These "Flavors" all have special scheduling considerations --
either they map to different HarwareUnits, or have unique scheduling
properties like fences.
HardwareUnitInfo provides a way to track and analyze the usage of some
hardware resource across the current scheduling region.
CandidateHeuristics holds the state for new heuristics, as well as the
implementations.
In addition, this adds new heuristics to use the various support pieces
listed above. tryCriticalResource attempts to schedule instructions that
use the most demanded HardwareUnit. If no such instructions are ready to
be scheduled, tryCriticalResourceDependency attempts to schedule
instructions which enable instructions that use demanded HardwareUnits.
We are incrementally adding the new heuristics. While in the process of
this, the state of tryCandidateCoexec may not be great - as is the case
after this PR.