9730 Commits

Author SHA1 Message Date
Matt Arsenault
0db4393762
AMDGPU: Add baseline tests for f64 rsq pattern handling (#172052) 2025-12-19 10:12:25 +01:00
vangthao95
031e9c989e
[AMDGPU][GlobalISel] Add RegBankLegalize support for G_FPTRUNC (#171723) 2025-12-18 13:16:46 -08:00
vangthao95
55089733b6
[AMDGPU][GlobalISel] Add readanylane combines for merge-like instruct… (#172546)
…ions

When a merge-like instruction has all readanylane sources and the result
is copied to VGPRs, eliminate the readanylanes by either using the
original unmerge source directly or building a new merge with the VGPR
sources.
2025-12-18 08:04:06 -08:00
macurtis-amd
e741cd88a1
AMDGPU/PromoteAlloca: Fix handling of users of multiple allocas (#172771)
With recent refactoring, LDS promotion worklists for all allocas are
populated upfront. In some cases, this results in a User in multiple
lists. Then as each list is processed, a User might get deleted via
removeFromParent, potentially leaving a dangling pointer in a subsequent
worklist.

Currently this only occurs for memcpy and memmove. Prior to refactoring,
these were handled by DeferredInstr, and were processed after the last
use of the then singular worklist.

This change moves processing of DeferredInstr to after all worklists
have be processed.
2025-12-18 08:41:21 -06:00
Frederik Harwath
5c05824d2b
[CodeGen] Rename expand-fp to expand-ir-insts (#172681)
The pass now contains a non-fp expansion and should
be used for any similar expansions regardless of the
types involved. Hence a generic name seems apt.

Rename the source files, pass, and adjust the pass
description. Move all tests for the expansions
that have previously been merged into the pass
to a single directory.
2025-12-18 11:15:04 +00:00
Matt Arsenault
d6f159dd05
AMDGPU: Add pattern for copysign of 0 (#172699)
Avoiding v_bfi_b32 is desirable since on gfx9 it
requires materializing the constant.

Similar could be done for infinity, with or 0x7fffffff
2025-12-18 11:34:24 +01:00
Frederik Harwath
71760f324f
[CodeGen] Merge ExpandLargeDivRem into ExpandFp (#172680)
Both passes expand instructions at the IR level.
They use the same kind of instruction visitation
logic and contain significant code duplication e.g.
for scalarization.
2025-12-18 09:22:47 +01:00
Matt Arsenault
399b33086f
AMDGPU: Add baseline tests for fcopysign with 0 magnitude (#172698) 2025-12-17 20:22:52 +01:00
Pankaj Dwivedi
28d4e33b65
[AMDGPU][SIInsertWaitCnt] Optimize loadcnt insertion at function boundaries (#169647)
On GFX12+, GLOBAL_INV increments the loadcnt counter but does not write
results to any VGPRs. Previously, we unconditionally inserted
s_wait_loadcnt 0 at function returns even when the only pending loadcnt
was from GLOBAL_INV instructions.

This patch optimizes waitcnt insertion by skipping the loadcnt wait at
function boundaries when no VGPRs have pending loads. This is determined
by checking if any VGPR has a score greater than the lower bound for
LOAD_CNT - if not, the pending loadcnt must be from non-VGPR-writing
instructions like GLOBAL_INV.

The optimization is limited to GFX12+ targets where GLOBAL_INV exists
and uses the extended wait count instructions.

This is a follow-up optimization to PR #135340 which added tracking for
GLOBAL_INV in the waitcnt pass.
2025-12-17 17:53:00 +05:30
Matt Arsenault
68aea8e202
AMDGPU: Avoid introducing unnecessary fabs in fast fdiv lowering (#172553)
If the sign bit of the denominator is known 0, do not emit the fabs.
Also, extend this to handle min/max with fabs inputs.

I originally tried to do this as the general combine on fabs, but
it proved to be too much trouble at this time. This is mostly
complexity introduced by expanding the various min/maxes into
canonicalizes, and then not being able to assume the sign bit
of canonicalize (fabs x) without nnan.

This defends against future code size regressions in the atan2 and
atan2pi library functions.
2025-12-17 00:22:12 +01:00
Matt Arsenault
b971b510d6
AMDGPU: Add baseline test for redundant fabs on fdiv expansion (#172552) 2025-12-16 23:26:55 +01:00
Matt Arsenault
eb1876c960
DAG: Fix arith_fence handling in SignBitIsZeroFP (#172537) 2025-12-16 20:10:38 +00:00
Frederik Harwath
51cdebf339
[AMDGPU] SIOptimizeExecMaskingPreRA: Fix crash on exec copy fold into INLINEASM (#172481)
The optimization crashed attempting to fix a fold of a COPY $exec
instruction into a use in an INLINEASM instruction because it attempts
to call isOperandLegal which crashes since the index is out of the
MCInstrDesc's operands array bounds.

Change SIOptimizeExecMaskingPreRA to skip the optimization if the
operand index is out of bounds.

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-12-16 17:55:05 +01:00
vangthao95
a3b3c027bb
[AMDGPU][NFC] Pre-commit tests for readanylane combines (#172398) 2025-12-16 08:25:29 -08:00
Dark Steve
7d381f2a56
[AMDGPU] Schedule independent instructions between s_barrier_signal and s_barrier_wait (#172057)
On gfx12+, the unified` s_barrier` is lowered to split
`s_barrier_signal/s_barrier_wait` pairs. By default, the dependency edge
between signal and wait has zero latency, causing the scheduler to emit
them adjacent to each other. This misses the opportunity to hide barrier
latency.

This patch adds synthetic latency to the signal-wait barrier edge to
encourage latency hiding. Independent instructions are scheduled in the
gap between split barrier signal and wait.

The latency is tunable via -amdgpu-barrier-signal-wait-latency.

Fixes: SWDEV-567090
2025-12-16 11:48:50 +05:30
vangthao95
e9b1f56d35
[AMDGPU][GlobalISel] Add RegBankLegalize support for G_BITREVERSE (#172101) 2025-12-15 17:45:33 -08:00
michaelselehov
3645cef1ef
[AMDGPU] LiveRegOptimizer: consider i8/i16 binops on SDWA (#155800)
PHI-node part was merged with PR#160909.

Extend `isOpLegal` to treat 8/16-bit vector add/sub/and/or/xor as
profitable on SDWA targets (stores and intrinsics remain profitable).
This repacks loop-carried values to i32 across BBs and restores SDWA
lowering instead of scattered lshr/lshl/or sequences.

Testing:
- Local: `check-llvm-codegen-amdgpu` is green (4314/4320 passed, 6
XFAIL).
- Additional: validated in AMD internal CI
2025-12-15 12:04:33 -05:00
Petar Avramovic
f024026a21
AMDGPU/GlobalISel: Regbanklegalize for G_CONCAT_VECTORS (#171471)
RegBankLegalize using trivial mapping helper, assigns same reg bank
to all operands, vgpr or sgpr.
Uncovers multiple codegen and regbank combiner regressions related to
looking through sgpr to vgpr copies.
Skip regbankselect-concat-vector.mir since agprs are not yet supported.
2025-12-15 10:37:40 +01:00
Juan Manuel Martinez Caamaño
c13bf9eb26
Reapply "[AMDGPU][SDAG] Add missing cases for SI_INDIRECT_SRC/DST (#170323) (#171838)
A buildbot failed for the original patch.

https://github.com/llvm/llvm-project/pull/171835 addresses the issue
raised by the buildbot.
After the fix is merged, the original patch is reapplied without any
change.
2025-12-15 09:05:00 +01:00
Craig Topper
0cdc1b6dd4
[SelectionDAG] Support integer types with multiple registers in ComputePHILiveOutRegInfo. (#172081)
PHIs that are larger than a legal integer type are split into multiple
virtual registers that are numbered sequentially. We can propagate the
known bits for each of these registers individually.

Big endian is not supported yet because the register order needs to be
reversed.

Fixes #171671
2025-12-13 13:24:41 -08:00
Jeffrey Byrnes
e45241a4fe
[AMDGPU] Hoist s_set_vgpr_msb past SALU program state instructions (#172108)
Hoisting past the program state instructions is legal and allows for
better coissue.
2025-12-12 18:04:20 -08:00
Syadus Sefat
f3c16454b4
[Reland][AMDGPU][GlobalISel] Add register bank legalization for buffer_load byte and short (#172065)
This patch adds register bank legalization support for buffer load byte
and short operations in the AMDGPU GlobalISel pipeline.

This is a re-land of #167798. I have fixed the failing test
/CodeGen/AMDGPU/GlobalISel/buffer-load-byte-short.ll
2025-12-12 14:47:08 -06:00
Aiden Grossman
b8816a4e83 Revert "[AMDGPU][GlobalISel] Add register bank legalization for buffer_load byte and short (#167798)"
This reverts commit 4dbd16bb62ca18b0c588e2f387ac5cc94a782efb.

This was causing buildbot failures, including on premerge when running
check-llvm.

https://lab.llvm.org/buildbot/#/builders/185/builds/30323
2025-12-12 17:35:25 +00:00
Matt Arsenault
2af693bbec
AMDGPU: Fix selection failure on bf16 inverse sqrt (#172044)
On !hasBF16TransInsts targets, an illegal rsq would form
and fail to select.
2025-12-12 18:10:08 +01:00
Syadus Sefat
4dbd16bb62
[AMDGPU][GlobalISel] Add register bank legalization for buffer_load byte and short (#167798)
This patch adds register bank legalization support for buffer load byte
and short operations in the AMDGPU GlobalISel pipeline.
2025-12-12 10:35:15 -06:00
Juan Manuel Martinez Caamaño
55c0e2e20f
[AMDGPU] Add missing cases for V_INDIRECT_REG_{READ/WRITE}_GPR_IDX and V/S_INDIRECT_REG_WRITE_MOVREL (#171835)
A buildbot failure in https://github.com/llvm/llvm-project/pull/170323
when expensive checks were used highlighted that some of these patterns
were missing.

This patch adds `V_INDIRECT_REG_{READ/WRITE}_GPR_IDX` and
`V/S_INDIRECT_REG_WRITE_MOVREL` for `V6` and `V7` vector sizes.
2025-12-12 15:45:34 +00:00
Pierre van Houtryve
025d0c0d1d
(reland) [AMDGPU][SIInsertWaitCnts] Use RegUnits-based tracking (#162077) (#171779)
Fixed a crash in Blender due to some weird control flow.
The issue was with the "merge" function which was only looking at the
keys of the "Other" VMem/SGPR maps. It needs to look at the keys of both
maps and merge them.

Original commit message below
----

The pass was already "reinventing" the concept just to deal with 16 bit
registers. Clean up the entire tracking logic to only use register
units.

There are no test changes because functionality didn't change, except:
- We can now track more LDS DMA IDs if we need it (up to `1 << 16`)
- The debug prints also changed a bit because we now talk in terms of
register units.

This also changes the tracking to use a DenseMap instead of a massive
fixed size table. This trades a bit of access speed for a smaller memory
footprint. Allocating and memsetting a huge table to zero caused a
non-negligible performance impact (I've observed up to 50% of the time
in the pass spent in the `memcpy` built-in on a big test file).

I also think we don't access these often enough to really justify using
a vector. We do a few accesses per instruction, but not much more. In a
huge 120MB LL file, I can barely see the trace of the DenseMap accesses.
2025-12-12 09:41:04 +01:00
Nicolai Hähnle
e760d0619f
AMDGPU/PromoteAlloca: Refactor into analysis / commit phases (#170512)
This change is motivated by the overall goal of finding alternative ways
to promote allocas to VGPRs. The current solution is effectively limited
to allocas whose size matches a register class, and we can't keep adding
more register classes. We have some downstream work in this direction,
and I'm currently looking at cleaning that up to bring it upstream.

This refactor paves the way to adding a third way of promoting allocas,
on top of the existing alloca-to-vector and alloca-to-LDS. Much of the
analysis can be shared between the different promotion techniques.

Additionally, the idea behind splitting the pass into an analysis
phase and a commit phase is that it ought to allow us to more easily
make
better "big picture" decision about which allocas to promote how in the
future.
2025-12-12 01:24:38 +00:00
vangthao95
854ef8df06
[AMDGPU][GlobalISel] Add RegBankLegalize support for G_FSUB (#171244) 2025-12-11 11:55:28 -08:00
Brox Chen
16c0893f04
[AMDGPU][True16] remove pack32 pattern from true16 mode (#171756)
Remove pack32 so that isel use reg_sequence in true16 mode for
build_vector. This generates better code
2025-12-11 09:49:27 -05:00
Stephen Thomas
7c328d8a0a
[AMDGPU][GCNHazardRecognizer] Remove instances of hardcoded S_WAITCNT_DEPCTR operand values (#171811)
Two S_WAITCNT_DEPCTR instructions are constructed with hardcoded operand
values. Replace these with appropriate calls to
AMDGPU::DepCtr::encodeFieldVmVsrc().

NFC, except that the original code was setting reserved operand bits
that should-be-zero, and this is now corrected.
2025-12-11 13:26:54 +00:00
Juan Manuel Martinez Caamaño
c02978867e
Revert "[AMDGPU][SDAG] Add missing cases for SI_INDIRECT_SRC/DST (#170323) (#171787)
```
Step 7 (test-check-all) failure: Test just built components: check-all completed (failure)
******************** TEST 'LLVM :: CodeGen/AMDGPU/insert_vector_dynelt.ll' FAILED ********************
Exit Code: 1

Command Output (stdout):
--
# RUN: at line 2
/home/buildbot/worker/as-builder-4/ramdisk/expensive-checks/build/bin/llc -mtriple=amdgcn -mcpu=fiji < /home/buildbot/worker/as-builder-4/ramdisk/expensive-checks/llvm-project/llvm/test/CodeGen/AMDGPU/insert_vector_dynelt.ll | /home/buildbot/worker/as-builder-4/ramdisk/expensive-checks/build/bin/FileCheck -enable-var-scope -check-prefixes=GCN /home/buildbot/worker/as-builder-4/ramdisk/expensive-checks/llvm-project/llvm/test/CodeGen/AMDGPU/insert_vector_dynelt.ll
# executed command: /home/buildbot/worker/as-builder-4/ramdisk/expensive-checks/build/bin/llc -mtriple=amdgcn -mcpu=fiji
# executed command: /home/buildbot/worker/as-builder-4/ramdisk/expensive-checks/build/bin/FileCheck -enable-var-scope -check-prefixes=GCN /home/buildbot/worker/as-builder-4/ramdisk/expensive-checks/llvm-project/llvm/test/CodeGen/AMDGPU/insert_vector_dynelt.ll
# RUN: at line 3
/home/buildbot/worker/as-builder-4/ramdisk/expensive-checks/build/bin/llc -O0 -mtriple=amdgcn -mcpu=fiji < /home/buildbot/worker/as-builder-4/ramdisk/expensive-checks/llvm-project/llvm/test/CodeGen/AMDGPU/insert_vector_dynelt.ll | /home/buildbot/worker/as-builder-4/ramdisk/expensive-checks/build/bin/FileCheck --check-prefixes=GCN-O0 /home/buildbot/worker/as-builder-4/ramdisk/expensive-checks/llvm-project/llvm/test/CodeGen/AMDGPU/insert_vector_dynelt.ll
# executed command: /home/buildbot/worker/as-builder-4/ramdisk/expensive-checks/build/bin/llc -O0 -mtriple=amdgcn -mcpu=fiji
# .---command stderr------------
# |
# | # After Instruction Selection
# | # Machine code for function insert_dyn_i32_6: IsSSA, TracksLiveness
# | Function Live Ins: $sgpr16 in %8, $sgpr17 in %9, $sgpr18 in %10, $sgpr19 in %11, $sgpr20 in %12, $sgpr21 in %13, $vgpr0 in %14, $vgpr1 in %15
# |
# | bb.0 (%ir-block.0):
# |   successors: %bb.1(0x80000000); %bb.1(100.00%)
# |   liveins: $sgpr16, $sgpr17, $sgpr18, $sgpr19, $sgpr20, $sgpr21, $vgpr0, $vgpr1
# |   %15:vgpr_32 = COPY $vgpr1
# |   %14:vgpr_32 = COPY $vgpr0
# |   %13:sgpr_32 = COPY $sgpr21
# |   %12:sgpr_32 = COPY $sgpr20
# |   %11:sgpr_32 = COPY $sgpr19
# |   %10:sgpr_32 = COPY $sgpr18
# |   %9:sgpr_32 = COPY $sgpr17
# |   %8:sgpr_32 = COPY $sgpr16
# |   %17:sgpr_192 = REG_SEQUENCE %8:sgpr_32, %subreg.sub0, %9:sgpr_32, %subreg.sub1, %10:sgpr_32, %subreg.sub2, %11:sgpr_32, %subreg.sub3, %12:sgpr_32, %subreg.sub4, %13:sgpr_32, %subreg.sub5
# |   %16:sgpr_192 = COPY %17:sgpr_192
# |   %19:vreg_192 = COPY %17:sgpr_192
# |   %28:sreg_64_xexec = IMPLICIT_DEF
# |   %27:sreg_64_xexec = S_MOV_B64 $exec
# |
# | bb.1:
# | ; predecessors: %bb.1, %bb.0
# |   successors: %bb.1(0x40000000), %bb.3(0x40000000); %bb.1(50.00%), %bb.3(50.00%)
# |
# |   %26:vreg_192 = PHI %19:vreg_192, %bb.0, %18:vreg_192, %bb.1
# |   %29:sreg_64 = PHI %28:sreg_64_xexec, %bb.0, %30:sreg_64, %bb.1
# |   %31:sreg_32_xm0 = V_READFIRSTLANE_B32 %14:vgpr_32, implicit $exec
# |   %32:sreg_64 = V_CMP_EQ_U32_e64 %31:sreg_32_xm0, %14:vgpr_32, implicit $exec
# |   %30:sreg_64 = S_AND_SAVEEXEC_B64 killed %32:sreg_64, implicit-def $exec, implicit-def $scc, implicit $exec
# |   $m0 = COPY killed %31:sreg_32_xm0
# |   %18:vreg_192 = V_INDIRECT_REG_WRITE_MOVREL_B32_V8 %26:vreg_192(tied-def 0), %15:vgpr_32, 3, implicit $m0, implicit $exec
# |   $exec = S_XOR_B64_term $exec, %30:sreg_64, implicit-def $scc
# |   S_CBRANCH_EXECNZ %bb.1, implicit $exec
# |
# | bb.3:
```

This reverts commit 15df9e701f1f1194a25e6123612cc735ad392ae4.
2025-12-11 10:08:20 +00:00
Juan Manuel Martinez Caamaño
15df9e701f
[AMDGPU][SDAG] Add missing cases for SI_INDIRECT_SRC/DST (#170323)
Before this patch, `insertelement/extractelement` with dynamic indices
would
fail to select with `-O0` for vector 32-bit element types with sizes 3,
5, 6 and 7,
which did not map to a `SI_INDIRECT_SRC/DST` pattern.

Other "weird" sizes bigger than 8 (like 13) are properly handled
already.

To solve this issue we add the missing patterns for the problematic
sizes.

Solves SWDEV-568862
2025-12-11 09:17:43 +01:00
Jay Foad
6ae0b9f586
[AMDGPU] Implement codegen for GFX11+ V_CVT_PK_[IU]16_F32 (#168719) 2025-12-10 22:26:59 +00:00
vangthao95
d162afa912
[AMDGPU][GlobalISel] Add RegBankLegalize support for G_FPEXT (#171483) 2025-12-10 08:58:27 -08:00
Nikita Popov
5a24dfa339
[SDAG] Remove most non-canonical libcall handing (#171288)
This is a followup to https://github.com/llvm/llvm-project/pull/171114,
removing the handling for most libcalls that are already canonicalized
to intrinsics in the middle-end. The only remaining one is fabs, which
has more test coverage than the others.
2025-12-10 11:45:26 +01:00
Diana Picus
578a26ada2
[AMDGPU] Relax restrictions on amdgcn.cs.chain intrinsic (#169785)
We have a new use-case for chain functions, so slightly relax the
restriction on which calling conventions may contain calls to chain
functions.
2025-12-10 11:12:46 +01:00
Mirko Brkušanin
5759a3a779
[AMDGPU] Add s_wakeup_barrier instruction for gfx1250 (#170501) 2025-12-10 09:45:13 +01:00
Vikram Hegde
aebab0578b
[NPM] Schedule PhysicalRegisterUsageAnalysis before RegUsageInfoCollectorPass (#168832)
RegUsageInfoCollectorPass requires PhysicalRegisterUsageAnalysis to be valid. this change is required since its a module analysis.
2025-12-10 11:15:37 +05:30
anjenner
27651133e2
AMDGPU: Drop and upgrade llvm.amdgcn.atomic.csub/cond.sub to atomicrmw (#105553)
These both perform conditional subtraction, returning the minuend and
zero respectively, if the difference is negative.
2025-12-09 23:13:33 +00:00
Anshil Gandhi
5052b6ce1d
[AMDGPU] Scavenge a VGPR to eliminate a frame index (#166979)
If the subtarget supports flat scratch SVS mode and there is no SGPR
available to replace a frame index, convert a scratch instruction in SS
form into SV form and replace the frame index with a scavenged VGPR.
Resolves #155902

Co-authored-by: Matt Arsenault <matthew.arsenault@amd.com>
2025-12-09 13:59:36 -05:00
pvanhout
4572f4f5b1 Revert "[AMDGPU][SIInsertWaitCnts] Use RegUnits-based tracking (#162077)"
Fails on https://lab.llvm.org/buildbot/#/builders/123/builds/31922

This reverts commit bf9344099c63549b2f19f8ede29f883669b0baca.
2025-12-09 14:48:19 +01:00
Pierre van Houtryve
bf9344099c
[AMDGPU][SIInsertWaitCnts] Use RegUnits-based tracking (#162077)
The pass was already "reinventing" the concept just to deal with 16 bit
registers. Clean up the entire tracking logic to only use register
units.

There are no test changes because functionality didn't change, except:
- We can now track more LDS DMA IDs if we need it (up to `1 << 16`)
- The debug prints also changed a bit because we now talk in terms of
register units.

This also changes the tracking to use a DenseMap instead of a massive
fixed size table. This trades a bit of access speed for a smaller memory
footprint. Allocating and memsetting a huge table to zero caused a
non-negligible performance impact (I've observed up to 50% of the time
in the pass spent in the `memcpy` built-in on a big test file).

I also think we don't access these often enough to really justify using
a vector. We do a few accesses per instruction, but not much more. In a
huge 120MB LL file, I can barely see the trace of the DenseMap accesses.
2025-12-09 13:51:19 +01:00
Guy David
29611f4cbe
[DAGCombiner] Relax nsz constraint for FP optimizations (#165011)
Some floating-point optimization don't trigger because they can produce
incorrect results around signed zeros, and rely on the existence of the
nsz flag which commonly appears when fast-math is enabled.
However, this flag is not a hard requirement when all of the users of
the combined value are either guaranteed to overwrite the sign-bit or
simply ignore it (comparisons, etc.).

The optimizations affected:
- fadd x, +0.0 -> x
- fsub x, -0.0 -> x
- fsub +0.0, x -> fneg x
- fdiv(x, sqrt(x)) -> sqrt(x)
- frem lowering with power-of-2 divisors
2025-12-09 12:07:46 +02:00
Vikram Hegde
c590b35f0f
[AMDGPU][NPM] Enable SIModeRegister and SIInsertHardclauses passes (#168831)
Passes already ported.
2025-12-09 14:01:15 +05:30
Matt Arsenault
786498b281
AMDGPU: Fix truncstore from v6f32 to v6f16 (#171212)
The v6bf16 cases work, but that's likely because v6bf16 isn't
currently an MVT.

Fixes: SWDEV-570985
2025-12-08 22:46:36 +00:00
Fei Peng
f803e463f9
Reland "Redesign Straight-Line Strength Reduction (SLSR) (#162930)" (#169614)
This PR implements parts of
https://github.com/llvm/llvm-project/issues/162376

- **Broader equivalence than constant index deltas**:
- Add Base-delta and Stride-delta matching for Add and GEP forms using
ScalarEvolution deltas.
- Reuse enabled for both constant and variable deltas when an available
IR value dominates the user.
- **Dominance-aware dictionary instead of linear scans**:
  - Tuple-keyed candidate dictionary grouped by basic block.
- Walk the immediate-dominator chain to find the nearest dominating
basis quickly and deterministically.
- **Simple cost model and best-rewrite selection**:
- Score candidate expressions and rewrites; select the highest-profit
rewrite per instruction.
- Skip rewriting when expressions are already foldable or
high-efficiency.
- **Path compression for better ILP**:
- Compress chains of rewrites to a deeper dominating basis when a
constant delta exists along the path, reducing dependent bumps on
critical paths.
- **Dependency-aware rewrite ordering**:
- Build a dependency graph (basis, stride, variable delta producers) and
rewrite in topological order.
- This dependency graph will be needed by the next PR that adds partial
strength reduction.
- **Correctness enhencment**
- Fix a correctness issue that reusing instructions with the same SCEV
may introduce poison.

---------

Co-authored-by: Kazu Hirata <kazu@google.com>
2025-12-08 16:07:27 -06:00
Shilei Tian
3ccd67295b
[AMDGPU] Fix a crash when a bool variable is used in inline asm (#171004)
Fixes SWDEV-570184.
2025-12-08 14:44:21 -05:00
Dark Steve
cc19f420b9
[AMDGPU][NPM] Port AMDGPUArgumentUsageInfo to NPM (#170886)
Port AMDGPUArgumentUsageInfo analysis to the NPM to fix suboptimal code
generation when NPM is enabled by default.

Previously, DAG.getPass() returns nullptr when using NPM, causing the
argument usage info to be unavailable during ISel. This resulted in
fallback to FixedABIFunctionInfo which assumes all implicit arguments
are needed, generating unnecessary register setup code for entry
functions.

Fixes LLVM::CodeGen/AMDGPU/cc-entry.ll

Changes:
- Split AMDGPUArgumentUsageInfo into a data class and NPM analysis
wrapper
- Update SIISelLowering to use DAG.getMFAM() for NPM path
- Add RequireAnalysisPass in addPreISel() to ensure analysis
availability

This follows the same pattern used for PhysicalRegisterUsageInfo.
2025-12-08 20:38:00 +05:30
Jay Foad
07bafab83d
[AMDGPU] Do not generate V_FMAC_DX9_ZERO_F32 on GFX12 (#171116)
GFX12 does not have the FMAC form of this instruction, only the FMA
form.

Fixes: #170437
2025-12-08 13:20:02 +00:00