10282 Commits

Author SHA1 Message Date
Sameer Sahasrabuddhe
f9adee2f6b
[AMDGPU] asyncmark support for ASYNC_CNT (#185813)
Some checks failed
Bazel Checks / Buildifier (push) Has been cancelled
Bazel Checks / Bazel Build/Test (push) Has been cancelled
Build CI Tooling Containers / Build Container abi-tests (push) Has been cancelled
Build CI Tooling Containers / Build Container format (push) Has been cancelled
Build CI Tooling Containers / Build Container lint (push) Has been cancelled
Build Windows CI Container / build-ci-container-windows (push) Has been cancelled
Build CI Container / Build Container X64 (push) Has been cancelled
Build CI Container / Build Container ARM64 (push) Has been cancelled
Build CI Container / Build Container agent X64 (push) Has been cancelled
Build CI Container / Build Container agent ARM64 (push) Has been cancelled
Build libc Container / Build libc container (ubuntu-24.04) (push) Has been cancelled
Build libc Container / Build libc container (ubuntu-24.04-arm) (push) Has been cancelled
Build Metrics Container / build-metrics-container (push) Has been cancelled
Check CI Scripts / Check Python Tests (push) Has been cancelled
Test documentation build / Test documentation build (push) Has been cancelled
Libclang Python Binding Tests / Build and run Python unit tests (3.13) (push) Has been cancelled
Libclang Python Binding Tests / Build and run Python unit tests (3.8) (push) Has been cancelled
Build Docker images for libc++ CI / build-and-push (push) Has been cancelled
Test Unprivileged Download Artifact Action / Upload Test Artifact (push) Has been cancelled
Zizmor GitHub Actions Analysis / Run zizmor (push) Has been cancelled
Build CI Tooling Containers / push-ci-container (push) Has been cancelled
Build Windows CI Container / push-ci-container (push) Has been cancelled
Build CI Container / push-ci-container (push) Has been cancelled
Build libc Container / push-libc-container (push) Has been cancelled
Build Metrics Container / push-metrics-container (push) Has been cancelled
Test Unprivileged Download Artifact Action / Test Unprivileged Download Artifact (push) Has been cancelled
Commit Access Review / commit-access-review (push) Has been cancelled
The ASYNC_CNT is used to track the progress of asynchronous copies
between global and LDS memories. By including it in asyncmark, the
compiler can now assist the programmer in generating waits for
ASYNC_CNT.

Assisted-By: Claude Sonnet 4.5

This is part of a stack:

- #185813
- #185810 

Fixes: LCOMPILER-332
2026-04-07 07:23:09 +05:30
Joe Nash
af95b0a615
[AMDGPU] Remove implicit super-reg defs on mov64 pseudos (#190379)
The mov64 pseudo is split into two 32 bit movs, but those 32 bit movs
had the full 64-bit register still implicitly defined. VOPD formation is
affected, so we can emit more of them.
2026-04-06 21:11:06 +00:00
Chinmay Deshpande
9033e872fd
[AMDGPU][GISel] RegBankLegalize rules for update_dpp (#190662) 2026-04-06 13:52:10 -07:00
Chinmay Deshpande
40d5a7d69e
[AMDGPU][UniformityAnalysis] Mark set_inactive and set_inactive_chain_arg as SourceOfDivergence (#190640)
`set_inactive` produces a result that varies per-lane based on the EXEC mask, even when both inputs are uniform.
2026-04-06 12:40:22 -07:00
Chinmay Deshpande
12e957fd7f
[AMDGPU][GISel] RegBankLegalize rules for amdgcn_inverse_ballot (#190629) 2026-04-06 10:30:35 -07:00
vangthao95
eb065bf028
AMDGPU/GlobalISel: RegBankLegalize rules for G_EXTRACT_VECTOR_ELT (#189144) 2026-04-06 10:22:11 -07:00
Wooseok Lee
0bef4c7aab
[AMDGPU] Add v2i32 and/or patterns for VOP3 AND_OR and OR3 operations (#188375)
Add ThreeOp_v2i32_Pats pattern class to support v2i32 vector operations
for AND_OR_B32 and OR3_B32 instructions. The new patterns check the
v2i32 and-or or or-or instruction sequence, extract individual 32-bit
elements from v2i32 operands, and applies the and_or or or3 vop3
operations.
2026-04-06 16:54:21 +00:00
Domenic Nutile
5b33f85a08
[AMDGPU] Change isSingleLaneExecution to account for WWM enabling lanes even if there's only one workitem (#188316)
This issue was discovered during some downstream work around Vulkan CTS
tests, specifically
`dEQP-VK.subgroups.arithmetic.compute.subgroupadd_float`

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2026-04-06 12:51:46 -04:00
Stanislav Mekhanoshin
e0908cd7a5
[AMDGPU] Specialize gfx1250 codegen tests for fake and real t16. NFC. (#190390)
This is preparation of turning on real true16, so we can easily
apply it or revert.
2026-04-04 01:55:18 -07:00
vangthao95
df1e67b379
AMDGPU/GlobalISel: RegBankLegalize rules for s_memtime, s_get_waveid (#190268) 2026-04-03 09:46:56 -07:00
Lakreite
a44c15874d
[AMDGPU][CodeGen] Implement SimplifyDemandedBitsForTargetNode for readfirstlane. (#190009)
Propagate demanded bits through readfirstlane intrinsic in
AMDGPUISelLowering with SimplifyDemandedBitsForTargetNode
implementation.

This allows upstream zero/sign extensions to be eliminated when only a
subset of bits is used after the intrinsic.

Partially addresses #128390.
2026-04-03 14:30:47 +02:00
michaelselehov
df48719df3
[AMDGPU] Add !noalias metadata to mem-accessing calls w/o pointer args (#188949)
addAliasScopeMetadata in AMDGPULowerKernelArguments skips instructions
with empty PtrArgs, including memory-accessing calls that have no
pointer arguments (e.g. builtins like threadIdx()). Because these calls
never receive !noalias metadata, ScopedNoAliasAA cannot prove they don't
alias noalias kernel arguments. MemorySSA then conservatively reports
them as clobbers, which prevents AMDGPUAnnotateUniformValues from
marking loads as noclobber, blocking scalarization (s_load) and forcing
expensive vector loads (global_load) instead.

Fix by adding all noalias kernel argument scopes to !noalias metadata
for memory-accessing instructions with no pointer arguments. Since such
instructions cannot access memory through any kernel pointer argument,
all noalias scopes are safe to apply.

This fixes a performance regression in rocFFT introduced by bd9668df0f00
("[AMDGPU] Propagate alias information in AMDGPULowerKernelArguments").

Assisted-by: Claude Opus
2026-04-03 08:41:05 +02:00
Stanislav Mekhanoshin
7084f18f27
[AMDGPU] Fix i16/i8 flat store in true16 with sramecc (#190238)
The pattern was guarded by the D16PreservesUnusedBits predicate
which is not needed for stores.
2026-04-02 17:32:50 -07:00
Simon Pilgrim
8991ce9cff
[AMDGPU] Add basic clmul test coverage (#190205) 2026-04-02 16:41:34 +00:00
zGoldthorpe
e9a62c7698
[DAG] computeKnownFPClass: handle ISD::FABS (#190069)
Use `KnownFPClass::fabs` to handle `ISD::FABS`.

This case will help with updating #188356 to use `computeKnownFPClass`.
2026-04-02 14:48:54 +00:00
Petar Avramovic
5226289b8e
Revert "AMDGPU: Codegen for v_dual_dot2acc_f32_f16/bf16 from VOP3" (#190159)
This reverts commit 47f6a19181b426baa03182ab6a7a41e16b35301d.
Breaks MIOpen, don't have propper fix yet.
2026-04-02 14:05:08 +00:00
Gabriel Baraldi
5e0a06b34d
Move ExpandMemCmp and MergeIcmp to the middle end (#77370)
Moving these into the middle-end pipeline will allow for additional
optimization of the expansion result, such as CSE of redundant loads
(c.f. https://godbolt.org/z/bEna4Md9r). For now, we conservatively place
the passes at the end of the middle-end pipeline, so we mostly don't
benefit from additional optimizations yet. The pipeline position will be
moved in a future change.

This builds on work done by legrosbuffle in
https://reviews.llvm.org/D60318.

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 09:57:00 +02:00
zGoldthorpe
9a354fc5a1
[SelectionDAG] Use KnownBits to determine if an operand may be NaN. (#188606)
Given a bitcast into a fp type, use the known bits of the operand to
infer whether the resulting value can never be NaN.
2026-04-01 22:47:01 -06:00
Stanislav Mekhanoshin
a9df7c7186
[AMDGPU] True16 support for bf16 clamp pattern on gfx1250 (#190036) 2026-04-01 14:26:42 -07:00
Mirko Brkušanin
5d9eb0c76a
[AMDGPU] Define new targets gfx1171 and gfx1172 (#187735) 2026-04-01 18:16:11 +02:00
LU-JOHN
c245d764b8
[CodeGen] Do not remove IMPLICIT_DEF unless all uses have undef flag added (#188133)
Do not remove IMPLICIT_DEF of a physreg unless all uses have an undef
flag added. Previously, only the first use instruction had undef flags
added. This will cause a failure in machine instruction verification.
Multi-instruction uses tested in AMDGPU/multi-use-implicit-def.mir and
X86/multi-use-implicit-def.mir.

---------

Signed-off-by: John Lu <John.Lu@amd.com>
2026-04-01 10:11:42 -05:00
Sergio Afonso
2cff995e91
[AMDGPU] Fix crash with dead frame indices in debug values (#183297)
When spill slots are eliminated (VGPR-to-AGPR, SGPR-to-VGPR lanes),
debug values referencing these frame indices were not always properly
cleaned up. This caused an assertion failure in getObjectOffset() when
PrologEpilogInserter tried to access the offset of a dead frame object.

The existing debug fixup code in SIFrameLowering and SILowerSGPRSpills
had two limitations:
1. It only checked one operand position, but DBG_VALUE_LIST instructions
can have multiple debug operands with frame indices.
2. It didn't handle all types of dead frame indices uniformly.

Fix by centralizing debug info cleanup in removeDeadFrameIndices(),
which already knows all frame indices being removed. This iterates over
all debug operands using MI.debug_operands().

Assisted-by: Claude Code.
2026-04-01 13:41:53 +01:00
Manuel Carrasco
ab4b689258
[AMDGPU][SIFoldOperands] Fix OR -1 fold (#189655)
In SIFoldOperands, folding `or x, -1` to `v_mov_b32 -1` removed
`Src1Idx`, which is incorrect because `-1` is in `Src0Idx` (after
canonicalization).

Closes https://github.com/llvm/llvm-project/issues/189677.
2026-04-01 13:37:37 +01:00
Daniil Fukalov
6bf794a02a
[AMDGPU] Disable generic DAG combines at -O0 to preserve debuggability. (#176304)
Disable generic DAG combines for AMDGPU at -O0 via
disableGenericCombines() to preserve instructions that users may want to
set breakpoints on during debugging.

Assisted-by: Cursor / Claude Opus 4.6
2026-04-01 11:55:17 +02:00
ambergorzynski
67d4842910
[NFC][AMDGPU] New test for untested case in SILowerSGPRSpills (#189426)
[This
case](f380a878d5/llvm/lib/Target/AMDGPU/SILowerSGPRSpills.cpp (L343-L345))
is not covered by any existing tests (checked using code coverage and by
inserting an `abort` at that line). I propose a new test that tests this
line.

This is demonstrated by showing that it is the only test that fails in
the presence of the `abort`.
2026-03-31 13:12:29 +01:00
Luke Lau
598f3535fa
[SelectionDAG] Expand CTTZ_ELTS[_ZERO_POISON] and handle legalization (#188691)
This is a second attempt at "[SelectionDAG] Expand
CTTZ_ELTS[_ZERO_POISON] and handle splitting" (#188220)

That PR had to be reverted in 7d39664a6ae8daaf186b65578492244d96a50bf2
because we had crashes on AMDGPU since we didn't have scalarization
support, and other crashes on PowerPC because we didn't handle the case
when a vector needed widened. Tests for these are added in
AMDGPU/cttz-elts.ll, RISCV/rvv/cttz-elts-scalarize.ll and
PowerPC/cttz-elts.ll.

The former crash has been fixed by adding
DAGTypeLegalizer::ScalarizeVecOp_CTTZ_ELTS.

The second crash has been fixed by reworking
TargetLowering::expandCttzElts. The expansion for CTTZ_ELTS is nearly
identical to VECTOR_FIND_LAST_ACTIVE, except it uses a reverse step
vector and subtracts the result from VF. The easiest way to fix these
crashes without introducing regressions is to reuse the
VECTOR_FIND_LAST_ACTIVE expansion which already handles the case where
the vector needs widened.

This means that the node now needs to take in a boolean vector argument
and uses VSELECT instead of an AND to zero out inactive lanes, so the op
promotion code has also been shared.
2026-03-31 07:25:57 +00:00
Matt Arsenault
f48425edca
AMDGPU: Match fract pattern with swapped edge case check (#189081)
A fract implementation can equivalently be written as
  r = fmin(x - floor(x))
  r = isnan(x) ? x : r;
  r = isinf(x) ? 0.0 : r;

or:
  r = fmin(x - floor(x));
  r = isinf(x) ? 0.0 : r;
  r = isnan(x) ? x : r;

Previously this only matched the previous form. Match
the case where the isinf check is the inner clamp. There are
a few more ways to write this pattern (e.g., move the clamp of
infinity to the input) but I haven't encountered that in the wild.

The existing code seems to be trying too hard to match noncanonical
variants of the pattern. Only handles the result that all 4 permutations
of compare and select produce out of instcombine.
2026-03-31 09:13:58 +02:00
vangthao95
b85492b3d3
AMDGPU/GlobalISel: RegBankLegalize rules for sudot4/sudot8 (#189104) 2026-03-30 16:23:25 -07:00
Simon Pilgrim
d74f098a30
[DAG] isKnownNeverNaN - fallback to computeKnownFPClass check (#189476)
Remove ConstantFPSDNode handling from isKnownNeverNaN and fallback to
using computeKnownFPClass if there are no opcode matches in
isKnownNeverNaN

The test check changes are due to isKnownNeverNaN not handling
UNDEF/POISON but computeKnownFPClass does (POISON in particular now
returns isKnownNeverNaN == true, preventing a ISD::FCANONICALIZE call in
expandFMINNUM_FMAXNUM).
2026-03-30 21:49:15 +00:00
Stanislav Mekhanoshin
5f99854d01
[AMDGPU] Drop A and B neg modifier from amdgcn_wmma_bf16_16x16x32_bf16 (#189468)
Fixes: LCOMPILER-1673
2026-03-30 14:14:22 -07:00
Alexey Merzlyakov
06725d7ef5
[GISel] Keep non-negative info in SUB(CTLZ) (#189314)
Implement non-negative value tracking for SUB-CTLZ chains in GlobalISel,
matching the behavior previously added to SelectionDAG.

Additionally, refactor the SelectionDAG implementation from the previous
patch to improve performance and code density.

Related to https://github.com/llvm/llvm-project/issues/136516 and
https://github.com/llvm/llvm-project/pull/186338#discussion_r2980420174
2026-03-30 22:10:47 +02:00
Jeffrey Byrnes
7364203924
Reapply "[AMDGPU] Add HWUI pressure heuristics to coexec strategy (#184929)" (#189121)
Reland https://github.com/llvm/llvm-project/pull/184929 after fixing
some issues in the NDEBUG builds.

3a640ee is unchanged from the previously approved PR, the unreviewed
portion of this PR is 9cabd8d
2026-03-30 12:18:29 -07:00
vangthao95
ec6574e90e
AMDGPU/GlobalISel: RegBankLegalize rules for udot2/sdot2 (#189103) 2026-03-30 10:43:05 -07:00
vangthao95
35a1961287
AMDGPU/GlobalISel: RegBankLegalize rules for dot products (#189110) 2026-03-30 10:15:12 -07:00
vangthao95
2f0118895b
AMDGPU/GlobalISel: RegBankLegalize rules for ds_append/ds_consume (#189143) 2026-03-30 09:57:57 -07:00
vangthao95
c32d670757
AMDGPU/GlobalISel: RegBankLegalize rules for ds_ordered_add/swap (#189137) 2026-03-30 09:57:04 -07:00
vangthao95
27e3c43d74
AMDGPU/GlobalISel: RegBankLegalize rules for global_load_lds (#189135) 2026-03-30 09:53:12 -07:00
vangthao95
f4d1745ab3
AMDGPU/GlobalISel: RegBankLegalize rules for lds_direct_load (#189134) 2026-03-30 09:52:34 -07:00
Mariusz Sikora
6caec7ecdb
[AMDGPU] Add tanh tests for gfx13 (#188240) 2026-03-30 14:30:04 +02:00
Matt Arsenault
c67475fb05
AMDGPU: Avoid using -march in tests (#189285) 2026-03-29 21:31:59 +00:00
Matt Arsenault
2c41a8de9a
AMDGPU: Fix using -march in a couple tests (#189271) 2026-03-29 18:42:28 +00:00
Folkert de Vries
73cddef788
optimize is_finite assembly (#169402)
Fixes https://github.com/llvm/llvm-project/issues/169270

Changes the implementation of `is_finite` to emit fewer instructions,
e.g.

X86_64

```asm
old: # 18 bytes
        movd    %xmm0, %eax
        andl    $2147483647, %eax
        cmpl    $2139095040, %eax
        setl    %al
        retq
new: # 15 bytes
        movd    %xmm0, %eax
        addl    %eax, %eax
        cmpl    $-16777216, %eax
        setb    %al
        retq
```

Aarch64

```asm
old:
        fmov    w9, s0
        mov     w8, #2139095040
        and     w9, w9, #0x7fffffff
        cmp     w9, w8
        cset    w0, lt
        ret
new:
        fmov    w8, s0
        ubfx    w8, w8, #23, #8
        cmp     w8, #255
        cset    w0, lo
        ret
```

See the issue for more information.
2026-03-29 14:28:07 +00:00
Ruiling, Song
c6fa976d5b
AMDGPU: Make VarIndex WeakTrackingVH in AMDGPUPromoteAlloca (#188921)
The test used to look all good, but actually not. The WeakVH just make
itself null after the pointed value being replaced. So a zero value was
used because VarIndex become null. The test checks looks all good.

Actually only the WeakTrackingVH have the ability to be updated to new
value.

Change the test slightly to make that using zero index is wrong.
2026-03-28 09:50:25 +08:00
Matt Arsenault
9be0cc173d
AMDGPU: Skip last corrections and scaling for afn llvm.sqrt.f64 (#183697)
Device libs has a fast sqrt macro implemented this way.
2026-03-27 23:59:25 +00:00
Stanislav Mekhanoshin
a2d84b5d8d
[AMDGPU] Remove neg support from 4 more gfx1250 WMMA (#189115)
These are previously covered by AMDGPUWmmaIntrinsicModsAllReuse.
2026-03-27 15:20:14 -07:00
Matt Arsenault
e825f42427
AMDGPU: Improve fsqrt f64 expansion with ninf (#183695) 2026-03-27 22:25:32 +01:00
Jeffrey Byrnes
1c3018b3d6
Revert "[AMDGPU] Add HWUI pressure heuristics to coexec strategy" (#189107)
Seems to be triggering some issues with the buildbots

https://lab.llvm.org/buildbot/#/builders/159/builds/44122

Unused variable + bad debug build.
2026-03-27 13:48:49 -07:00
Jeffrey Byrnes
a9f5f93440
[AMDGPU] Add HWUI pressure heuristics to coexec strategy (#184929)
Adds basic support for new heuristics for the CoExecSchedStrategy.

InstructionFlavor provides a way to map instructions to different
"Flavors". These "Flavors" all have special scheduling considerations --
either they map to different HarwareUnits, or have unique scheduling
properties like fences.

HardwareUnitInfo provides a way to track and analyze the usage of some
hardware resource across the current scheduling region.

CandidateHeuristics holds the state for new heuristics, as well as the
implementations.

In addition, this adds new heuristics to use the various support pieces
listed above. tryCriticalResource attempts to schedule instructions that
use the most demanded HardwareUnit. If no such instructions are ready to
be scheduled, tryCriticalResourceDependency attempts to schedule
instructions which enable instructions that use demanded HardwareUnits.

We are incrementally adding the new heuristics. While in the process of
this, the state of tryCandidateCoexec may not be great - as is the case
after this PR.
2026-03-27 13:34:03 -07:00
Matt Arsenault
28f24b5029
AMDGPU: Add baseline tests for more fract patterns (#189092) 2026-03-27 19:38:54 +00:00
Kewen Meng
a996f2a8db
Revert "AMDGPU: Fold frame indexes into disjoint s_or_b32" (#189074)
Reverts llvm/llvm-project#102345

unblock bot: https://lab.llvm.org/buildbot/#/builders/10/builds/25403
2026-03-27 18:33:01 +00:00