9987 Commits

Author SHA1 Message Date
vangthao95
d354ea6add
AMDGPU/GlobalISel: RegBankLegalize rules for buffer atomic cmpswap (#180666) 2026-02-10 11:11:38 -08:00
Mirko Brkušanin
4280f0d241
[AMDGPU] Add dot4 fp8/bf8 instructions for gfx1170 (#180516) 2026-02-10 12:14:49 +01:00
Anshil Gandhi
bd6dd94584
[AMDGPU] Add legalization rules for atomicrmw max/min ops (#180502)
Adds rules for G_ATOMICRMW_{MAX, MIN, UMAX, UMIN, UINC_WRAP, UDEC_WRAP}.
Each of these generic opcode are supported for S32 and S64 types
on flat, global and local address spaces.
2026-02-10 16:04:05 +05:30
Matt Arsenault
302ff8fd00
InstCombine: Use SimplifyDemandedFPClass on fmul (#177490)
Start trying to use SimplifyDemandedFPClass on instructions, starting
with fmul. This subsumes the old transform on multiply of 0. The
main change is the introduction of nnan/ninf. I do not think anywhere
was systematically trying to introduce fast math flags before, though
a few odd transforms would set them.

Previously we only called SimplifyDemandedFPClass on function returns
with nofpclass annotations. Start following the pattern of
SimplifyDemandedBits, where this will be called from relevant root
instructions.

I was wondering if this should go into InstCombineAggressive, but that
apparently does not make use of InstCombineInternal's worklist.
2026-02-10 09:49:31 +00:00
Diana Picus
24405f070f
[AMDGPU] Add intrinsic exposing s_alloc_vgpr (#163951)
Make it possible to use `s_alloc_vgpr` at the IR level. This is a huge
footgun and use for anything other than compiler internal purposes is
heavily discouraged. The calling code must make sure that it does not
allocate fewer VGPRs than necessary - the intrinsic is NOT a request to
the backend to limit the number of VGPRs it uses (in essence it's not so
different from what we do with the dynamic VGPR flags of the
`amdgcn.cs.chain` intrinsic, it just makes it possible to use this
functionality in other scenarios).
2026-02-10 09:28:31 +01:00
vangthao95
8d8864237b
AMDGPU/GlobalISel: Regbanklegalize rules for G_FSQRT (#179817)
Add S16 rules for G_FSQRT. S32 and S64 are expanded by the legalizer.
2026-02-09 18:24:28 -08:00
Gheorghe-Teodor Bercea
d1dc843c18
[AMDGPU] Enable sinking of free vector ops that will be folded into their uses (#162580)
Sinking ShuffleVectors / ExtractElement / InsertElement into user blocks
can help enable SDAG combines by providing visibility to the values
instead of emitting CopyTo/FromRegs. The sink IR pass disables sinking
into loops, so this PR extends the CodeGenPrepare target hook
shouldSinkOperands.

Co-authored-by: Jeffrey Byrnes <Jeffrey.Byrnes@amd.com>

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2026-02-09 14:14:31 -05:00
vangthao95
404f9e6c99
AMDGPU/GlobalISel: RegBankLegalize rules for amdgcn_sffbh (#180099)
Change test to use update_llc_test_checks.py and make `v_flbit` test
actually divergent.
2026-02-09 09:18:03 -08:00
vangthao95
0040bdf532
AMDGPU/GlobalISel: Regbanklegalize rules for buffer atomic swap (#180265) 2026-02-09 09:04:17 -08:00
Anshil Gandhi
ab2e10d80f
[AMDGPU] Add legalization rules for G_ATOMICRMW_FADD (#175257)
G_ATOMICRMW_FADD is supported on flat, global and local address spaces
for S32, S64 and V2S16 values.
2026-02-09 15:37:27 +00:00
Shilei Tian
65b4099219
[AMDGPU] Fix instruction size for 64-bit literal constant operands (#180387)
`getLit64Encoding` uses a different approach to determine whether 64-bit
literal encoding is used, which caused a size mismatch between the
`MachineInstr` and the `MCInst`.

For `!isValid32BitLiteral`, it is effectively `!(isInt<32>(Val) ||
isUInt<32>(Val))`, which is `!isInt<32>(Val) && !isUInt<32>(Val)`, but
in `getLit64Encoding`, it is `!isInt<32>(Val) || !isUInt<32>(Val)`.
2026-02-09 14:31:52 +00:00
Shilei Tian
392f0c9767
[NFC][AMDGPU] Add a test to show the impact of wrong s_mov_b64 instruction size (#180386) 2026-02-09 08:56:28 -05:00
Mirko Brkušanin
45b037cf7a
[AMDGPU] Add fp8/bf8 conversion instructions for gfx1170 (#180191) 2026-02-09 13:56:43 +01:00
Petr Kurapov
27a8ab09fa
[AMDGPU] Fix V_INDIRECT_REG_READ_GPR_IDX expansion with immediate index (#179699)
The definition for V_INDIRECT_REG_READ_GPR_IDX_B32_V*'s SSrc_b32 operand
allows immediates, but the expansion logic handles only register cases
now. This can result in expansion failures when e.g.
llvm.amdgcn.wave.reduce.umin.i32 is folded into a constant and then used
as an insertelement idx.
2026-02-09 11:33:30 +01:00
Matt Arsenault
2ffb54364f
AMDGPU: Add a test for libcall simplify pow handling (#180491)
This case could be turned into powr or pown, so track which
case ends up preferred.
2026-02-09 10:01:26 +00:00
Pierre van Houtryve
b79ba02479
[AMDGPU][GFX12.5] Reimplement monitor load as an atomic operation (#177343)
Load monitor operations make more sense as atomic operations, as
non-atomic operations cannot be used for inter-thread communication w/o
additional synchronization.
The previous built-in made it work because one could just override the
CPol bits, but that bypasses the memory model and forces the user to learn
about ISA bits encoding.

Making load monitor an atomic operation has a couple of advantages.
First, the memory model foundation for it is stronger. We just lean on the
existing rules for atomic operations. Second, the CPol bits are abstracted away
from the user, which avoids leaking ISA details into the API.

This patch also adds supporting memory model and intrinsics
documentation to AMDGPUUsage.

Solves SWDEV-516398.
2026-02-09 09:57:27 +01:00
Matt Arsenault
8554ed738f
AMDGPU: Add syntax for s_wait_event values (#180272)
Previously this would just print hex values. Print names for the
recognized values, matching the sp3 syntax.
2026-02-09 08:29:55 +00:00
Matt Arsenault
0c583e784e
AMDGPU: Add llvm.amdgcn.s.wait.event intrinsic (#180170)
Exactly match the s_wait_event instruction. For some reason we already
had this instruction used through llvm.amdgcn.s.wait.event.export.ready,
but that hardcodes a specific value. This should really be a bitmask
that
can combine multiple wait types.

gfx11 -> gfx12 broke compatabilty in a weird way, by inverting the
interpretation of the bit but also shifting the used bit by 1. Simplify
the selection of the old intrinsic by just using the magic number 2,
which should satisfy both cases.
2026-02-09 08:45:13 +01:00
paperchalice
c53acf0443
[SelectionDAGBuilder] Remove NoNaNsFPMath uses (#169904)
Replaced by checking fast-math flags or value tracking results.
2026-02-09 09:48:07 +08:00
paperchalice
5c5677d7b8
[llvm] Remove "no-infs-fp-math" attribute support (#180083)
One of global options in `TargetMachine::resetTargetOptions`, now all
backends no longer support it, remove it.
2026-02-09 08:43:33 +08:00
Alex Wang
a947599991
[AMDGPU][GlobalISel] Add lowering for G_FMODF (#180152)
Add generic expansion for G_FMODF matching the SelectionDAG
implementation.

Enable G_FMODF lowering for AMDGPU with tests.

Related: #179434
2026-02-07 18:43:55 +00:00
Jay Foad
269fda118a
[AMDGPU] Fix pattern selecting fmul to v_fma_mix_f32 (#180210)
This needs to use an addend of -0.0 to get the correct result when the
result should be -0.0.
2026-02-07 09:31:07 +00:00
Iasonaskrpr
6c6fb00c94
[AMDGPU] Optimize S_OR_B32 to S_ADDK_I32 where possible (#177949)
This PR fixes #177753, converting disjoint S_OR_B32 to S_ADDK_I32
whenever possible, it avoids this transformation in case S_OR_B32 can be
converted to bitset.

Note on Test Failures (Draft Status) This change causes significant
register reshuffling across the test suite due to the new allocation
hints and the swaps performed in case src0 is not a register and src1,
along with the change from or to addk. To avoid a massive, noisy diff
during the initial logic review:

This Draft PR only includes a representative sample of updated tests.
CodeGen/AMDGPU/combine-reg-or-const.ll -> Showcases change from S_OR to
S_ADDK
CodeGen/AMDGPU/s-barrier.ll -> Showcases swap between Src0 and Src1 if
src0 is not a register

The rest of the tests show the result of the register allocation hint we
give, I have checked every test I updated and they seem ok to me.

Once the core logic is approved, I will run the update script across the
remaining ~70 failing tests and mark the PR as "Ready for Review."
2026-02-07 09:10:12 +00:00
vangthao95
e67bfe85d9
[AMDGPU][GlobalISel] Fix D16 buffer load RegBankLegalize rules (#179982)
Use fast StandardB rule and add uniform rules and uniform tests.
2026-02-06 08:01:30 -08:00
Frederik Harwath
272d6dd445
[AMDGPU] Support v_lshl_add_u64 with non-constant shift amount (#179904)
This commit also adds GlobalISel testing to llvm/test/CodeGen/AMDGPU/lshl-add-u64.ll.
2026-02-06 16:03:58 +01:00
Jay Foad
4a6697f393
[AMDGPU] Fix and simplify patterns selecting fsub to v_fma_mix_f32 (#180169)
Select (fsub x, y) -> (fma y, -1.0, x). Using -1.0 as the constant
avoids the need for ComplexPatterns to negate x or y.

This also fixes the bad pattern (fsub x, y) -> (fma -x, 1.0, y).
2026-02-06 14:39:13 +00:00
Mirko Brkušanin
20b5849e17
[AMDGPU] Define new target gfx1170 (#180185) 2026-02-06 14:38:50 +01:00
Nikita Popov
0287d789e0
[ExpandIRInsts] Freeze input in itofp expansion (#180157)
We are introducing branches on the value, and branch on undef/poison is
UB, so the value needs to be frozen.
2026-02-06 12:52:31 +01:00
Pierre van Houtryve
6824db46c6
[AMDGPU] Set MOThreadPrivate on memory accesses for spills (#179414)
Mark the memory operand of spill load/stores as MOThreadPrivate, so that
these loads and stores are emitted with `nv` set.

The reason is that scratch memory used by spills will never be shared by
another thread. It's purely thread local and thus a good fit for the
`nv` bit, which is controlled by the MOThreadPrivate flag.
2026-02-06 11:14:14 +00:00
Pierre van Houtryve
b738491d2f
[AMDGPU][GFX12.5] Add support for emitting memory operations with nv bit set (#179413)
- Add `MONonVolatile` MachineMemOperand flag.
- Set nv=1 on memory operations on GFX12.5 if the operation accesses a
constant address space,
  is an invariant load, or has the `MONonVolatile` flag set.
2026-02-06 11:35:46 +01:00
Abhinav Garg
8cc06421a5
Adding support for G_STRICT_FMA in new reg bank select (#170330)
This patch adds legalization rules for G_STRICT_FMA opcode.

---------

Co-authored-by: Abhinav Garg <abhigarg@amd.com>
2026-02-06 15:43:37 +05:30
Petar Avramovic
9d11a6670e
AMDGPU/GlobalISel: Regbanklegalize rules for G_FREEZE (#179796)
Move G_FREEZE handling to AMDGPURegBankLegalizeRules.cpp.
Added support for uniform S1.
2026-02-06 11:05:47 +01:00
Steffen Larsen
5654ecd5dd
[DAGCombiner] Fix exact power-of-two signed division for large integers (#177340)
Previously, the DAG combiner did not optimize exact signed division by a
power-of-two constant divisor for integer types exceeding the size of
division supported by the target architecture (e.g., i128 on x86-64).
However, such an optimization was expected by the division expansion
logic, leading to unsupported division operations making it to
instruction selection.
This commit addresses this issue by making an exception to the existing
exclusion of signed division with the exact flag for the aforementioned
operations. That is, the DAG combiner will now optimize exact signed
division if the divisor is a power-of-two constant and the integer type
exceeds the size of division supported by the target architecture.

---------

Signed-off-by: Steffen Holst Larsen <HolstLarsen.Steffen@amd.com>
2026-02-06 09:40:32 +01:00
vangthao95
1ef499b1fc
AMDGPU/GlobalISel: Fix buffer store RegBankLegalize rules (#179994)
Enable commented out D16 v3f16 tests.
2026-02-05 16:20:09 -08:00
vangthao95
376dc83d7a
[AMDGPU][GlobalISel] Add RegBankLegalize rules for TFE buffer loads (#179529) 2026-02-05 13:42:11 -08:00
Matt Arsenault
82799a448e
Reapply "AMDGPU: Use real copysign in fast pow (#97152)" (#178036)
This reverts commit bff619f91015a633df659d7f60f842d5c49351df.

This was reverted due to regressions caused by poor copysign
optimization, which have been fixed.
2026-02-05 20:40:38 +00:00
Alexander Weinrauch
3b16468814
[AMDGPU] Global and Buffer loads to LDS should not increase lgkmcnt (#179305)
`global_load_lds` and `buffer_load to lds` do only increment `vmcnt` and
not touch `lgkmcnt`. This causes invalid `waitcnts` for some Triton
kernels, similar to the added lit tests.

Note that the change for buffer ops is not necesssary, i.e. the lit test
passes even before this PR, because it seems like `SIInsertWaitcnts`
does not use `LGKM_CNT` for buffer ops. But this change might prevent a
bug in the future.
2026-02-05 09:36:00 -08:00
anjenner
903a5ab93d
[AMDGPU] [GlobalISel] Add register bank legalize rules for G_FEXP2 (#179954)
Also G_INTRINSIC_TRUNC, G_INTRINSIC_ROUNDEVEN, G_FFLOOR, G_FCEIL, and
G_FLOG2.
2026-02-05 16:35:31 +00:00
Vigneshwar Jayakumar
2dcd75eb44
[AMDGPU] Fix missing waitcnt after buffer_wbl2 (#178316)
On GFX9, BUFFER_WBL2 is used to write back dirty cache lines and
requires an s_waitcnt vmcnt(0) afterwards to ensure completion.

This patch fixes by incrementing vmcnt for buffer_wbl2 instruction

---------

Co-authored-by: Jay Foad <jay.foad@gmail.com>
2026-02-05 10:13:51 -06:00
Nikita Popov
d3fb3c5d36
[GISel][CallLowering] Keep IR types longer (#179946)
GISel CallLowering currently does a Type -> EVT -> Type roundtrip early
on when populating ArgInfo in splitToValueType(). This is a bit odd as
this structure operates at the IR Type level. Keep the original type
there and only convert to EVT when performing assignments.
2026-02-05 16:37:08 +01:00
vangthao95
e0c2cc7ed0
[AMDGPU][GlobalISel] Add buffer store byte/short RegBankLegalize rules (#179367) 2026-02-05 07:18:39 -08:00
Matt Arsenault
2502e3b7ba
IR: Promote "denormal-fp-math" to a first class attribute (#174293)
Convert "denormal-fp-math" and "denormal-fp-math-f32" into a first
class denormal_fpenv attribute. Previously the query for the effective
denormal mode involved two string attribute queries with parsing. I'm
introducing more uses of this, so it makes sense to convert this
to a more efficient encoding. The old representation was also awkward
since it was split across two separate attributes. The new encoding
just stores the default and float modes as bitfields, largely avoiding
the need to consider if the other mode is set.

The syntax in the common cases looks like this:
  `denormal_fpenv(preservesign,preservesign)`
  `denormal_fpenv(float: preservesign,preservesign)`
  `denormal_fpenv(dynamic,dynamic float: preservesign,preservesign)`

I wasn't sure about reusing the float type name instead of adding a
new keyword. It's parsed as a type but only accepts float. I'm also
debating switching the name to subnormal to match the current
preferred IEEE terminology (also used by nofpclass and other
contexts).

This has a behavior change when using the command flag debug
options to set the denormal mode. The behavior of the flag
ignored functions with an explicit attribute set, per
the default and f32 version. Now that these are one attribute,
the flag logic can't distinguish which of the two components
were explicitly set on the function. Only one test appeared to
rely on this behavior, so I just avoided using the flags in it.

This also does not perform all the code cleanups this enables.
In particular the attributor handling could be cleaned up.

I also guessed at how to support this in MLIR. I followed
MemoryEffects as a reference; it appears bitfields are expanded
into arguments to attributes, so the representation there is
a bit uglier with the 2 2-element fields flattened into 4 arguments.
2026-02-05 13:31:26 +00:00
Acim Maravic
b0827f3b36
[LLVM] Select fma_mix for v_cvt_f32_f16 and v_add_f32/v_mul_f32 (#160151) 2026-02-05 11:51:25 +01:00
Matt Arsenault
8461579298
AMDGPU: Add nofpclass when expanding pow (#177933)
The codegen regression is tracked in #177913
2026-02-05 07:40:21 +01:00
Nicolai Hähnle
3e1e86ef1f
[AMDGPU] Return two MMOs for load-to-lds and store-from-lds intrinsics (#175845)
Accurately represent both the load and the store part of those intrinsics.

The test changes seem to be mostly fairly insignificant changes caused
by subtly different scheduler behavior.
2026-02-04 12:29:49 -08:00
Alex Wang
b33a0e6101
[SelectionDAG] Add expansion for llvm.modf intrinsic (#179434)
Targets without a `modf` libcall lower the intrinsic directly, matching
the existing `llvm.frexp` expansion. Targets with an existing libcall
are unchanged.

Fixes #173021
2026-02-04 21:25:47 +01:00
vangthao95
273ee97738
[AMDGPU][GlobalISel] Add G_SADDE/SSUBE RegBankLegalize rule (#179603) 2026-02-04 09:41:27 -08:00
vangthao95
b0aea0539f
[AMDGPU][GlobalISel] Add buffer load format D16 RegBankLegalize rules (#179566) 2026-02-04 09:20:41 -08:00
Brox Chen
2e58f6024a
[AMDGPU][True16] t16 pseudo for mubuffer d16 load/store (#178822)
create t16 pseudos for mubuffer d16 load/store with vgpr16 in vdst/vdata
and use these t16 pseudo for isel pattern. Lower them back to d16
machine inst in mc level.
2026-02-04 10:54:11 -05:00
Carl Ritson
be9ba44256
[AMDGPU] Add machineFunctionInfo to recent MIR tests (#179602)
Initialize machineFunctionInfo in recently added MIR tests to assist in
downstream testing.
2026-02-04 22:12:01 +09:00