Moving these into the middle-end pipeline will allow for additional
optimization of the expansion result, such as CSE of redundant loads
(c.f. https://godbolt.org/z/bEna4Md9r). For now, we conservatively place
the passes at the end of the middle-end pipeline, so we mostly don't
benefit from additional optimizations yet. The pipeline position will be
moved in a future change.
This builds on work done by legrosbuffle in
https://reviews.llvm.org/D60318.
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Do not remove IMPLICIT_DEF of a physreg unless all uses have an undef
flag added. Previously, only the first use instruction had undef flags
added. This will cause a failure in machine instruction verification.
Multi-instruction uses tested in AMDGPU/multi-use-implicit-def.mir and
X86/multi-use-implicit-def.mir.
---------
Signed-off-by: John Lu <John.Lu@amd.com>
When spill slots are eliminated (VGPR-to-AGPR, SGPR-to-VGPR lanes),
debug values referencing these frame indices were not always properly
cleaned up. This caused an assertion failure in getObjectOffset() when
PrologEpilogInserter tried to access the offset of a dead frame object.
The existing debug fixup code in SIFrameLowering and SILowerSGPRSpills
had two limitations:
1. It only checked one operand position, but DBG_VALUE_LIST instructions
can have multiple debug operands with frame indices.
2. It didn't handle all types of dead frame indices uniformly.
Fix by centralizing debug info cleanup in removeDeadFrameIndices(),
which already knows all frame indices being removed. This iterates over
all debug operands using MI.debug_operands().
Assisted-by: Claude Code.
In SIFoldOperands, folding `or x, -1` to `v_mov_b32 -1` removed
`Src1Idx`, which is incorrect because `-1` is in `Src0Idx` (after
canonicalization).
Closes https://github.com/llvm/llvm-project/issues/189677.
Disable generic DAG combines for AMDGPU at -O0 via
disableGenericCombines() to preserve instructions that users may want to
set breakpoints on during debugging.
Assisted-by: Cursor / Claude Opus 4.6
[This
case](f380a878d5/llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp (L343-L345))
is not covered by any existing tests (checked using code coverage and by
inserting an `abort` at that line). I propose a new test that tests this
line.
This is demonstrated by showing that it is the only test that fails in
the presence of the `abort`.
This is a second attempt at "[SelectionDAG] Expand
CTTZ_ELTS[_ZERO_POISON] and handle splitting" (#188220)
That PR had to be reverted in 7d39664a6ae8daaf186b65578492244d96a50bf2
because we had crashes on AMDGPU since we didn't have scalarization
support, and other crashes on PowerPC because we didn't handle the case
when a vector needed widened. Tests for these are added in
AMDGPU/cttz-elts.ll, RISCV/rvv/cttz-elts-scalarize.ll and
PowerPC/cttz-elts.ll.
The former crash has been fixed by adding
DAGTypeLegalizer::ScalarizeVecOp_CTTZ_ELTS.
The second crash has been fixed by reworking
TargetLowering::expandCttzElts. The expansion for CTTZ_ELTS is nearly
identical to VECTOR_FIND_LAST_ACTIVE, except it uses a reverse step
vector and subtracts the result from VF. The easiest way to fix these
crashes without introducing regressions is to reuse the
VECTOR_FIND_LAST_ACTIVE expansion which already handles the case where
the vector needs widened.
This means that the node now needs to take in a boolean vector argument
and uses VSELECT instead of an AND to zero out inactive lanes, so the op
promotion code has also been shared.
A fract implementation can equivalently be written as
r = fmin(x - floor(x))
r = isnan(x) ? x : r;
r = isinf(x) ? 0.0 : r;
or:
r = fmin(x - floor(x));
r = isinf(x) ? 0.0 : r;
r = isnan(x) ? x : r;
Previously this only matched the previous form. Match
the case where the isinf check is the inner clamp. There are
a few more ways to write this pattern (e.g., move the clamp of
infinity to the input) but I haven't encountered that in the wild.
The existing code seems to be trying too hard to match noncanonical
variants of the pattern. Only handles the result that all 4 permutations
of compare and select produce out of instcombine.
Remove ConstantFPSDNode handling from isKnownNeverNaN and fallback to
using computeKnownFPClass if there are no opcode matches in
isKnownNeverNaN
The test check changes are due to isKnownNeverNaN not handling
UNDEF/POISON but computeKnownFPClass does (POISON in particular now
returns isKnownNeverNaN == true, preventing a ISD::FCANONICALIZE call in
expandFMINNUM_FMAXNUM).
Reland https://github.com/llvm/llvm-project/pull/184929 after fixing
some issues in the NDEBUG builds.
3a640ee is unchanged from the previously approved PR, the unreviewed
portion of this PR is 9cabd8d
The test used to look all good, but actually not. The WeakVH just make
itself null after the pointed value being replaced. So a zero value was
used because VarIndex become null. The test checks looks all good.
Actually only the WeakTrackingVH have the ability to be updated to new
value.
Change the test slightly to make that using zero index is wrong.
Adds basic support for new heuristics for the CoExecSchedStrategy.
InstructionFlavor provides a way to map instructions to different
"Flavors". These "Flavors" all have special scheduling considerations --
either they map to different HarwareUnits, or have unique scheduling
properties like fences.
HardwareUnitInfo provides a way to track and analyze the usage of some
hardware resource across the current scheduling region.
CandidateHeuristics holds the state for new heuristics, as well as the
implementations.
In addition, this adds new heuristics to use the various support pieces
listed above. tryCriticalResource attempts to schedule instructions that
use the most demanded HardwareUnit. If no such instructions are ready to
be scheduled, tryCriticalResourceDependency attempts to schedule
instructions which enable instructions that use demanded HardwareUnits.
We are incrementally adding the new heuristics. While in the process of
this, the state of tryCandidateCoexec may not be great - as is the case
after this PR.
74e4694 moved ObjCARCContract from running before the codegen pipeline
into addISelPrepare(), which runs after PreISelIntrinsicLowering.
This broke ObjCARCContract's retainRV-to-claimRV optimization because
ObjCARCContract identifies ARC calls via intrinsics, not their lowered
counterparts.
This patch restores the pre-74e4694 ordering by moving ObjCARCContract
to addISelPasses.
The IntrinsicInst.cpp change looks extraneous but is required here:
ObjCARCContract may now rewrite the bundle operand from retainRV to
claimRV. When PreISelIntrinsicLowering then encounters this new
intrinsic use, lowerObjCCall asserts mayLowerToFunctionCall.
Assisted-by: claude
rdar://137997453
Add uniform and divergent register bank legalization rules for the amdgcn_perm intrinsic (v_perm_b32). Since this is a VALU-only instruction, the uniform case maps the destination to UniInVgprB32 and all source operands to VgprB32.
Add register bank legalization rules for the amdgcn_permlane64 intrinsic
in the new RegBankLegalize framework.
After GISel legalization, permlane64 always operates on S32 — sub-32-bit
types are anyext'd to S32 and types wider than 32 bits are split into
S32 parts by legalizeLaneOp. Add rules for B32 type.
Also enable -new-reg-bank-select in the permlane64 lit test and update
affected check lines.
v_cvt_scalef32_2xpk16_fp6_f32 and v_cvt_scalef32_2xpk16_bf6_f32, as multipass instructions,
the destination operand must not overlap with any of the source operands.
In this work, we apply Constraints = "@earlyclobber $vdst" to these two instructions.
Fixes: LCCOMPILER-561