8758 Commits

Author SHA1 Message Date
Changpeng Fang
70e78be7dc
AMDGPU: Custom lower fptrunc vectors for f32 -> f16 (#141883)
The latest asics support v_cvt_pk_f16_f32 instruction. However current
implementation of vector fptrunc lowering fully scalarizes the vectors,
and the scalar conversions may not always be combined to generate the
packed one.
We made v2f32 -> v2f16 legal in
https://github.com/llvm/llvm-project/pull/139956. This work is an
extension to handle wider vectors. Instead of fully scalarization, we
split the vector to packs (v2f32 -> v2f16) to ensure the packed
conversion can always been generated.
2025-06-06 15:15:24 -07:00
LU-JOHN
549ce80f27
[AMDGPU][NFC] Add test for 64-bit lshr with shifts >=32 (#138281)
Record current results for 64-bit lshr with shifts >=32.

---------

Signed-off-by: John Lu <John.Lu@amd.com>
2025-06-06 17:04:54 -04:00
LU-JOHN
3bbb49610e
[AMDGPU][NFC] Add tests for 64-bit ashr with shifts >= 32 (#142463)
Record current results for 64-bit ashr with shifts >=32.

Signed-off-by: John Lu <John.Lu@amd.com>
2025-06-06 17:04:45 -04:00
Stanislav Mekhanoshin
19e2fd5e75
[AMDGPU] Patterns for <2 x bfloat> fneg (fabs) (#142911) 2025-06-05 10:00:24 -07:00
Stanley Gambarin
33974b41c7
[GlobalISel] support lowering of G_SHUFFLEVECTOR with pointer args (#141959) 2025-06-05 09:13:51 -07:00
Brox Chen
d8b245741d
[AMDGPUI][True16][CodeGen] global atomic load i8 in true16 mode (#142822)
Update codegen pattern for global atomic load i8 with d16 instructions
2025-06-05 11:35:18 -04:00
Nick Sarnie
3b9ebe9201
[clang] Simplify device kernel attributes (#137882)
We have multiple different attributes in clang representing device
kernels for specific targets/languages. Refactor them into one attribute
with different spellings to make it more easily scalable for new
languages/targets.

---------

Signed-off-by: Sarnie, Nick <nick.sarnie@intel.com>
2025-06-05 14:15:38 +00:00
Harrison Hao
b2379bd5d5
[AMDGPU] Support bottom-up postRA scheduing. (#135295)
Solely relying on top‑down scheduling can underutilize hardware, since
long‑latency instructions often end up scheduled too late and their
latency isn’t well hidden. Adding bottom‑up post‑RA scheduling lets us
move those instructions earlier, which improves latency hiding and
yields roughly a 2% performance gain on key benchmarks.
2025-06-05 22:07:06 +08:00
Stanislav Mekhanoshin
9b992f29e0
[AMDGPU] Baseline fneg-fabs.bf16.ll tests. NFC. (#142910) 2025-06-05 02:23:23 -07:00
Stanislav Mekhanoshin
0c1c60fa63
[AMDGPU] Make <2 x bfloat> fabs legal (#142908) 2025-06-05 02:22:09 -07:00
Stanislav Mekhanoshin
100a1d0c4c
[AMDGPU] Baseline fabs.bf16.ll tests. NFC. (#142907) 2025-06-05 00:57:10 -07:00
Ruiling, Song
0487db1f13
MachineScheduler: Improve instruction clustering (#137784)
The existing way of managing clustered nodes was done through adding
weak edges between the neighbouring cluster nodes, which is a sort of
ordered queue. And this will be later recorded as `NextClusterPred` or
`NextClusterSucc` in `ScheduleDAGMI`.

But actually the instruction may be picked not in the exact order of the
queue. For example, we have a queue of cluster nodes A B C. But during
scheduling, node B might be picked first, then it will be very likely
that we only cluster B and C for Top-Down scheduling (leaving A alone).

Another issue is:
```
   if (!ReorderWhileClustering && SUa->NodeNum > SUb->NodeNum)
      std::swap(SUa, SUb);
   if (!DAG->addEdge(SUb, SDep(SUa, SDep::Cluster)))
```
may break the cluster queue.

For example, we want to cluster nodes (order as in `MemOpRecords`): 1 3
2. 1(SUa) will be pred of 3(SUb) normally. But when it comes to (3, 2),
As 3(SUa) > 2(SUb), we would reorder the two nodes, which makes 2 be
pred of 3. This makes both 1 and 2 become preds of 3, but there is no
edge between 1 and 2. Thus we get a broken cluster chain.

To fix both issues, we introduce an unordered set in the change. This
could help improve clustering in some hard case.

One key reason the change causes so many test check changes is: As the
cluster candidates are not ordered now, the candidates might be picked
in different order from before.

The most affected targets are: AMDGPU, AArch64, RISCV.

For RISCV, it seems to me most are just minor instruction reorder, don't
see obvious regression.

For AArch64, there were some combining of ldr into ldp being affected.
With two cases being regressed and two being improved. This has more
deeper reason that machine scheduler cannot cluster them well both
before and after the change, and the load combine algorithm later is
also not smart enough.

For AMDGPU, some cases have more v_dual instructions used while some are
regressed. It seems less critical. Seems like test `v_vselect_v32bf16`
gets more buffer_load being claused.
2025-06-05 15:28:04 +08:00
Stanislav Mekhanoshin
a56442529c
[AMDGPU] Make <2 x bfloat> fneg legal (#142870) 2025-06-04 22:09:25 -07:00
Shilei Tian
8cd5604f59
[AMDGPU][AtomicExpand] Use full flat emulation if a target supports f64 global atomic add instruction (#142859)
If a target supports f64 global atomic add instruction, we can also use
full flat emulation.
2025-06-05 00:45:42 -04:00
Stanislav Mekhanoshin
9d41159023
[AMDGPU] Add baseline fneg.bf16.ll tests. NFC. (#142866)
This is a copy of the fneg.f16.ll, just with type replaced.
The final logic shall be the same as with f16 as these are
just bit operations.

---------

Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
2025-06-04 20:54:03 -07:00
Harrison Hao
8ca220f1dd
[NFC][AMDGPU] Add lit tests for FMA combining with freeze and nnan variants (#142628)
`freeze` on `fmul` (without `nnan`) followed by `fadd` or `fsub` into a
single `fma` is supported.
This patch adds lit tests to verify the optimization behavior for both
nnan and non-nnan variants.
2025-06-05 09:55:44 +08:00
Brox Chen
d2f06b2729
[AMDGPU][True16][MC][CodeGen] true16 mode for v_cvt_pk_bf8/fp8_f32 (#141881)
Update true16/fake16 profile with v_cvt_pk_bf8/fp8_f32, keeping the
vdst_in profile, and update codegen pattern.

update mc test and codegen test.
2025-06-04 11:29:26 -04:00
Brox Chen
b668b6439a
[AMDGPU][True16][CodeGen] legalize 16bit and 32bit use-def chain for moveToVALU in si-fix-sgpr-lowering (#138734)
Two changes in this patch:
1. Covered another case in legalizeOperandVALUt16 functions and the COPY
lowering, when SALU16 is used by SALU32, need to insert a reg_sequence
after moved to valu (previously only considered SALU32 used by SALU16
case)
2. Moved the useMI analysis into addUsersToMoveVALUList. Legalize the
targetted operand when needed.

Turn on frem test with true16 mode for gfx1150 which is failing before
this patch. A few bitcast tests also impacted by this change with some
v_mov being replaced to dual mov
2025-06-04 09:53:10 -04:00
Carl Ritson
e47e4d8ae6
[AMDGPU] SIInsertHardClause: add configurable clause length limit (#142343)
Add command line and function attribute configuration of hard clause
length limit (within hardware maximum).
This allows performance tuning for shaders which benefit from smaller
clauses.
2025-06-04 15:00:34 +09:00
Vigneshwar Jayakumar
b3a8c1ef3a
[AMDGPU] Bugfix for scaled MFMA parsing FP literals (#142493)
bugfix on parsing FP literals for scale values in the scaled MFMA.

Due to the change in order of operands between MCinst and parsed
operands, the FP literal imms for scale values were not parsed
correctly.

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-06-03 19:27:57 -05:00
YunQiang Su
bd831372b2
expandFMINIMUMNUM_FMAXIMUMNUM: Quiet is not needed for NaN vs NaN (#139237)
New LangRef doesn't requires quieting for NaN vs NaN, aka the result may
be sNaN for sNaN vs NaN.
See: https://github.com/llvm/llvm-project/pull/139228
2025-06-04 08:20:48 +08:00
Harrison Hao
0107c9333c
[DAG] canCreateUndefOrPoison – mark fneg/fadd/fsub/fmul/fdiv/frem as not poison generating (#142345)
After revisiting the LLVM Language Reference Manual, it is confirmed
that
plain floating-point operations (`fneg`, `fadd`, `fsub`, `fmul`, `fdiv`,
and `frem`)
propagate poison but do not inherently create new poison values. Thus, 
`SelectionDAG::canCreateUndefOrPoison` should return `false` for these 
operations by default.

Poison generation in FP instructions occurs only when specific fast-math
flags (`nnan`, `ninf`, or the collective fast) are present, as these
flags
explicitly convert NaN or Inf results into poison.

References:

- [`fneg` instruction
documentation](https://llvm.org/docs/LangRef.html#fneg-instruction)
- [`fadd` instruction
documentation](https://llvm.org/docs/LangRef.html#fadd-instruction)
- [`fsub` instruction
documentation](https://llvm.org/docs/LangRef.html#fsub-instruction)
- [`fmul` instruction
documentation](https://llvm.org/docs/LangRef.html#fmul-instruction)
- [`fdiv` instruction
documentation](https://llvm.org/docs/LangRef.html#fdiv-instruction)
- [`frem` instruction
documentation](https://llvm.org/docs/LangRef.html#frem-instruction)
- [Fast-Math Flags
documentation](https://llvm.org/docs/LangRef.html#fast-math-flags)
2025-06-03 19:21:40 +08:00
Diana Picus
130080fab1
[AMDGPU] Skip register uses in AMDGPUResourceUsageAnalysis (#133242)
Don't count register uses when determining the maximum number of
registers used by a function. Count only the defs. This is really an
underestimate of the true register usage, but in practice that's not
a problem because if a function uses a register, then it has either
defined it earlier, or some other function that executed before has
defined it.

In particular, the register counts are used:
1. When launching an entry function - in which case we're safe because
   the register counts of the entry function will include the register
   counts of all callees.
2. At function boundaries in dynamic VGPR mode. In this case it's safe
   because whenever we set the new VGPR allocation we take into account
   the outgoing_vgpr_count set by the middle-end.

The main advantage of doing this is that the artificial VGPR arguments
used only for preserving the inactive lanes when using the
llvm.amdgcn.init.whole.wave intrinsic are no longer counted. This
enables us to allocate only the registers we need in dynamic VGPR mode.

---------

Co-authored-by: Thomas Symalla <5754458+tsymalla@users.noreply.github.com>
2025-06-03 11:20:48 +02:00
Yingwei Zheng
1984c7539e
[ValueTracking] Do not use FMF from fcmp (#142266)
This patch introduces an FMF parameter for
`matchDecomposedSelectPattern` to pass FMF flags from select, instead of
fcmp.

Closes https://github.com/llvm/llvm-project/issues/137998.
Closes https://github.com/llvm/llvm-project/issues/141017.
2025-06-02 18:21:14 +08:00
Harrison Hao
1a7f5f5833
[AMDGPU] Promote nestedGEP allocas to vectors (#141199)
Supports the `nestedGEP`pattern that
 appears when an alloca is first indexed as an array element and then
 shifted with a byte‑offset GEP:

```llvm
  %SortedFragments = alloca [10 x <2 x i32>], addrspace(5), align 8
  %row  = getelementptr [10 x <2 x i32>], ptr addrspace(5) %SortedFragments, i32 0, i32 %j
  %elt1 = getelementptr i8, ptr addrspace(5) %row, i32 4
  %val  = load i32, ptr addrspace(5) %elt1
```

The pass folds the two levels of addressing into a single vector lane
 index and keeps the whole object in a VGPR:

```llvm
  %vec  = freeze <20 x i32> poison              ; alloca promote  <20 x i32>
  %idx0 = mul i32 %j, 2                         ; j * 2
  %idx  = add i32 %idx0, 1                      ; j * 2 + 1
  %val  = extractelement <20 x i32> %vec, i32 %idx
```

This eliminates the scratch read.
2025-06-02 16:20:14 +08:00
Matt Arsenault
ad0a52202e
AMDGPU: Improve v32f16/v32bf16 copysign handling (#142177) 2025-05-31 08:24:51 +02:00
Matt Arsenault
3aeffcfde1
AMDGPU: Improve v16f16/v16bf16 copysign handling (#142176) 2025-05-31 08:18:52 +02:00
Matt Arsenault
ffee01e748
AMDGPU: Improve v8f16/v8bf16 copysign handling (#142175) 2025-05-31 08:15:45 +02:00
Matt Arsenault
20ad4209dd
AMDGPU: Improve v4f16/v4bf16 copysign handling (#142174) 2025-05-31 08:09:51 +02:00
Matt Arsenault
4aa4005e04
AMDGPU: Make copysign with matching v2f16/v2bf16 inputs legal (#142173)
Fixes #141931
2025-05-31 08:06:49 +02:00
Shilei Tian
4d48673562 Reapply "Reapply "[AMDGPU] Make getAssumedAddrSpace return AS1 for pointer kernel arguments (#137488)""
This reverts commit 37ea3b32cdcb6c0dcecbcc4bf844f5190c7378dd.
2025-05-30 22:11:22 -04:00
Shilei Tian
37ea3b32cd Revert "Reapply "[AMDGPU] Make getAssumedAddrSpace return AS1 for pointer kernel arguments (#137488)""
This reverts commit 4efc13f8ff1eaf4f9fb1fcea8d4552b3eca052ca.
2025-05-30 22:06:16 -04:00
Shilei Tian
4efc13f8ff Reapply "[AMDGPU] Make getAssumedAddrSpace return AS1 for pointer kernel arguments (#137488)"
This reverts commit 3c6211c183885afb5d89259a53c4f4f46a6bf399.
2025-05-30 21:56:24 -04:00
Shilei Tian
3c6211c183 Revert "[AMDGPU] Make getAssumedAddrSpace return AS1 for pointer kernel arguments (#137488)"
This reverts commit 9bf6b2a8cb0467b62173659306e43a0346f063a2.
2025-05-30 21:15:25 -04:00
Shilei Tian
9bf6b2a8cb
[AMDGPU] Make getAssumedAddrSpace return AS1 for pointer kernel arguments (#137488) 2025-05-30 17:30:42 -04:00
Matt Arsenault
6a6aec6f4e
AMDGPU: Handle vectors in copysign sign type combine (#142157)
This avoids some ugly codegen on pre-16-bit instruction targets now
from annoying f16 legalization effects. This also avoids regressions
on newer targets in a future patch.
2025-05-30 20:02:07 +02:00
Matt Arsenault
e39e99022a
AMDGPU: Handle vectors in copysign magnitude sign case (#142156) 2025-05-30 19:58:55 +02:00
Matt Arsenault
ba4f4a1a18
AMDGPU: Add more f16 copysign tests (#142115) 2025-05-30 19:56:15 +02:00
Matt Arsenault
c9cca5cdc4
AMDGPU: Move bf16 copysign tests to separate file (#142114)
Make symmetric with other copysign tests
2025-05-30 19:52:56 +02:00
Matt Arsenault
d11f9d45e4
AMDGPU: Avoid using kernels in f16 copysign test (#142113)
Avoid the memory noise in tests that predate function support.
2025-05-30 19:49:45 +02:00
Jay Foad
f8d3bdf6a2
[AMDGPU] Fix SIFixSGPRCopies handling of STRICT_WWM and friends (#142122)
SIFixSGPRCopies handled STRICT_WWM (and similar WWM/WQM pseudos) like a
COPY. In particular, if the source was a VGPR and the result was an
SGPR, lowerVGPR2SGPRCopies would replace it with a readfirstlane,
erasing the original pseudo and hence sabotaging the WWM region marking
which is supposed to be performed by SIWholeQuadMode.

Fix this by handling it more like INSERT_SUBREG, PHI and REG_SEQUENCE:
if the source is a VGPR then move the result to a VGPR, and keep the
pseudo.
2025-05-30 16:32:56 +01:00
Daniil Fukalov
5208f722d8
[AMDGPU] Fix SIFoldOperandsImpl::canUseImmWithOpSel() for VOP3 packed [B]F16 imms. (#142142)
VOP3 instructions ignore opsel source modifiers, so a constant that
contains two different [B]F16 imms cannot be encoded into instruction
with an src opsel.

E.g. without the fix the following instructions

`s_mov_b32 s0, 0x40003c00 // <half 1.0, half 2.0>`
`v_cvt_scalef32_pk_fp8_f16 v0, s0, v2`

lose `2.0` imm and are folded into

`v_cvt_scalef32_pk_fp8_f16 v1, 1.0, 1.0`

Fixes SWDEV-531672
2025-05-30 16:38:07 +02:00
LU-JOHN
f88a9a32d9
[AMDGPU] Extend SRA i64 simplification for shift amts in range [33:62] (#138913)
Extend sra i64 simplification to shift constants in range [33:62]. Shift
amounts 32 and 63 were already handled.
New testing for shift amts 33 and 62 added in sra.ll. Changes to other
test files were to adapt previous test results to this extension.

---------

Signed-off-by: John Lu <John.Lu@amd.com>
2025-05-30 16:21:38 +02:00
Matt Arsenault
f70e920c87
AMDGPU: Directly check if shrink-instructions run is post-RA (#142009) 2025-05-30 15:52:18 +02:00
Matt Arsenault
a227b26d35
AMDGPU: Fix broken XFAILed test for fat pointer null initializers (#142015)
This was failing on the buffer fat pointer lowering error in the
addrspace(7) case, not the expected asm printer breakage. Also remove
the attempt at FileChecking the result, since that is dependent on the
actual fix and we want the unexpected pass whenever the assert is fixed.
2025-05-30 07:55:46 +02:00
Matt Arsenault
6b81483e28
AMDGPU: Start using LLVMContext errors in buffer fat pointer lowering (#142014)
Avoid using report_fatal_error. Many more uses that should be converted
in the pass remain.
2025-05-30 07:52:45 +02:00
Shilei Tian
84a69a0f8f
[AMDGPU] Move InferAddressSpacesPass to middle end optimization pipeline (#138604)
It will run twice in the non-LTO pipeline with `O1` or higher. In LTO post link pipeline, it will be run once with `O2` or higher, since inline and SROA don't run in `O1`.
2025-05-29 17:20:56 -04:00
Matt Arsenault
cc8d253f39
AMDGPU: Handle other fmin flavors in fract combine (#141987)
Since the input is either known not-nan, or we have explicit use
code checking if the input is a nan, any of the 3 is valid to match.
2025-05-29 22:11:01 +02:00
Matt Arsenault
c569248b74
AMDGPU: Add baseline tests for fract combine with other fmin types (#141986) 2025-05-29 22:05:00 +02:00
Matt Arsenault
3c5c0709e5
AMDGPU: Add missing fract test (#141985)
This was missing the case where the fcmp condition and select were
inverted.
2025-05-29 21:57:58 +02:00