7854 Commits

Author SHA1 Message Date
Shilei Tian
a74659445d
[AMDGPU] Skip terminators when forcing emit zero flag (#112116)
When forcing emit zero, we need to skip terminators of a MBB; otherwise
the terminator list of the MBB would be broken.
2024-10-14 11:46:18 -04:00
Akshat Oke
dbfca24b99
[MIR] Serialize virtual register flags (#110228)
[MIR] Serialize virtual register flags

This introduces target-specific vreg flag serialization. Flags are represented as `uint8_t` and the `TargetRegisterInfo` override provides methods `getVRegFlagValue` to deserialize and `getVRegFlagsOfReg` to serialize.
2024-10-14 14:19:53 +05:30
Janek van Oirschot
50866e84d1
Revert "[AMDGPU] Avoid resource propagation for recursion through multiple functions" (#112013)
Reverts llvm/llvm-project#111004
2024-10-11 17:10:28 +01:00
Juan Manuel Martinez Caamaño
2d5f3b0a61
[AMDGPU][SIPreEmitPeephole] mustRetainExeczBranch: use BranchProbability and TargetSchedmodel (#109818)
Remove s_cbranch_execnz branches if the transformation is
profitable according to `BranchProbability` and `TargetSchedmodel`.
2024-10-11 17:45:59 +02:00
Janek van Oirschot
67160c5ab5
[AMDGPU] Avoid resource propagation for recursion through multiple functions (#111004)
Avoid constructing recursive MCExpr definitions when multiple functions
cause a recursion.

Fixes #110863
2024-10-11 16:42:50 +01:00
Matt Arsenault
14705a912f
CodeGen: Remove redundant REQUIRES registered-target from tests (#111982)
These are already in target specific test directories.
2024-10-11 16:16:12 +04:00
Petar Avramovic
7b0d56be1d
AMDGPU/GlobalISel: Fix inst-selection of ballot (#109986)
Both input and output of ballot are lane-masks:
result is lane-mask with 'S32/S64 LLT and SGPR bank'
input is lane-mask with 'S1 LLT and VCC reg bank'.
Ballot copies bits from input lane-mask for
all active lanes and puts 0 for inactive lanes.
GlobalISel did not set 0 in result for inactive lanes
for non-constant input.
2024-10-11 11:40:27 +02:00
Fabian Ritter
173c68239d
[AMDGPU] Enable unaligned scratch accesses (#110219)
This allows us to emit wide generic and scratch memory accesses when we
do not have alignment information. In cases where accesses happen to be
properly aligned or where generic accesses do not go to scratch memory,
this improves performance of the generated code by a factor of up to 16x
and reduces code size, especially when lowering memcpy and memmove
intrinsics.

Also: Make the use of the FeatureUnalignedScratchAccess feature more
consistent: FeatureUnalignedScratchAccess and EnableFlatScratch are now
orthogonal, whereas, before, code assumed that the latter implies the
former at some places.

Part of SWDEV-455845.
2024-10-11 08:50:49 +02:00
Jay Foad
62b3a4bc70
[AMDGPU] Improve codegen for s_barrier_init (#111866) 2024-10-10 19:40:02 +01:00
Mikhail Goncharov
8a849a2a56 Revert "Reapply "[AMDGPU][GlobalISel] Fix load/store of pointer vectors, buffer.*.pN (#110714)" v2 (#111708)"
This reverts commit 4b4a0d419c81b8b12a7dbb33dae1f7e9be91a88f.

New test fails on buildbots https://lab.llvm.org/buildbot/#/builders/63/builds/2039 https://lab.llvm.org/buildbot/#/builders/127/builds/1055
2024-10-10 13:37:44 +02:00
Matt Arsenault
c36f902372
AMDGPU/GlobalISel: Insert m0 initialization before sextload/zextload (#111720)
Fixes missing m0 initialize for pre-gfx9 targets with local extending
loads.
2024-10-10 14:01:49 +04:00
Shilei Tian
5a74a4a667
[Attributor] Take the address space from addrspacecast directly (#108258)
Currently `AAAddressSpace` relies on identifying the address spaces of
all underlying objects. However, it might infer sub-optimal address
space when the underlying object is a function argument. In
`AMDGPUPromoteKernelArgumentsPass`, the promotion of a pointer kernel
argument is by adding a series of `addrspacecast` instructions (as shown
below), and hoping `InferAddressSpacePass` can pick it up and do the
rewriting accordingly.

Before promotion:

```
define amdgpu_kernel void @kernel(ptr %to_be_promoted) {
  %val = load i32, ptr %to_be_promoted
  ...
  ret void
}
```

After promotion:

```
define amdgpu_kernel void @kernel(ptr %to_be_promoted) {
  %ptr.cast.0 = addrspace cast ptr % to_be_promoted to ptr addrspace(1)
  %ptr.cast.1 = addrspace cast ptr addrspace(1) %ptr.cast.0 to ptr
  # all the use of %to_be_promoted will use %ptr.cast.1
  %val = load i32, ptr %ptr.cast.1
  ...
  ret void
}
```

When `AAAddressSpace` analyzes the code after promotion, it will take
`%to_be_promoted` as the underlying object of `%ptr.cast.1`, and use its
address space (which is 0) as its final address space, thus simply do
nothing in `manifest`. The attributor framework will them eliminate the
address space cast from 0 to 1 and back to 0, and replace `%ptr.cast.1`
with `%to_be_promoted`, which basically reverts all changes by
`AMDGPUPromoteKernelArgumentsPass`.

IMHO I'm not sure if `AMDGPUPromoteKernelArgumentsPass` promotes the
argument in a proper way. To improve the handling of this case, this PR
adds an extra handling when iterating over all underlying objects. If an
underlying object is a function argument, it means it reaches a terminal
such that we can't futher deduce its underlying object further. In this
case, we check all uses of the argument. If they are all `addrspacecast`
instructions and their destination address spaces are same, we take the
destination address space.

Fixes: SWDEV-482640.
2024-10-09 22:51:07 -04:00
Jeffrey Byrnes
a31d0b2e2b [AMDGPU] Remove some lit check lines
Change-Id: I77e72d23d41095b8fcc47996d8004f9e264968de
2024-10-09 16:12:31 -07:00
Krzysztof Drewniak
4b4a0d419c
Reapply "[AMDGPU][GlobalISel] Fix load/store of pointer vectors, buffer.*.pN (#110714)" v2 (#111708)
This adds `-disable-gisel-legality-check` to some gfx6 and gfx7 test
lines to prevent behavior mismatches between debug and release builds

The first attempted reapply was #111059

This reverts commit e075dcf7d270fd52dc837163ff24e8c872dfeb49.
2024-10-09 17:11:41 -05:00
Matt Arsenault
a075e785b8
AMDGPU: Fix incorrectly selecting fp8/bf8 conversion intrinsics (#107291)
Trying to codegen these on targets without the instructions should
fail to select. Not sure if all the predicates are correct. We had
a fake one disconnected to a feature which was always true.

Fixes: SWDEV-482274
2024-10-09 21:38:47 +04:00
Jeffrey Byrnes
17bc959961
[AMDGPU] Optionally Use GCNRPTrackers during scheduling (#93090)
This adds the ability to use the GCNRPTrackers during scheduling. These trackers have several advantages over the generic trackers: 1. global live-thru trackers, 2. subregister based RP deltas, and 3. flexible vreg -> PressureSet mappings.

This feature is off-by-default to ease with the roll-out process. In particular, when using the optional trackers, the scheduler will still maintain the generic trackers leading to unnecessary compile time.
2024-10-09 09:54:11 -07:00
Matt Arsenault
e85fcb7631
AMDGPU: Add instruction flags when lowering ctor/dtor (#111652)
These should be well behaved address computations.
2024-10-09 18:03:35 +04:00
Matt Arsenault
1e357cde48
AMDGPU: Use pointer types more consistently (#111651)
This was using addrspace 0 and 1 pointers interchangably. This works
out since they happen to use the same size, but consistently query
or use the correct one.
2024-10-09 17:23:50 +04:00
Matt Arsenault
890e481358 AMDGPU: Regenerate test checks 2024-10-09 16:39:15 +04:00
Matt Arsenault
671cbcf642
AMDGPU: Add baseline tests for gep flag handling (#110814)
We need to know the address computation won't overflow on
older subtargets to match the addressing mode of stack instructions.
2024-10-09 13:48:01 +04:00
Matt Arsenault
c198f775cd
AMDGPU: Remove flat/global fmin/fmax intrinsics (#105642)
These have been replaced with atomicrmw
2024-10-09 09:27:28 +04:00
Shilei Tian
88a239d292
[AMDGPU] Adopt new lowering sequence for fdiv16 (#109295)
The current lowering of `fdiv16` can generate incorrectly rounded result
in some cases. The new sequence was provided by the HW team, as shown
below written in C++.


```
half fdiv(half a, half b) {
  float a32 = float(a);
  float b32 = float(b);
  float r32 = 1.0f / b32;
  float q32 = a32 * r32;
  float e32 = -b32 * q32 + a32;
  q32 = e32 * r32 + q32;
  e32 = -b32 * q32 + a32;
  float tmp = e32 * r32;
  uin32_t tmp32 = std::bit_cast<uint32_t>(tmp);
  tmp32 = tmp32 & 0xff800000;
  tmp = std::bit_cast<float>(tmp32);
  q32 = tmp + q32;
  half q16 = half(q32);
  q16 = div_fixup_f16(q16);
  return q16;
}
```

Fixes SWDEV-477608.
2024-10-08 09:49:20 -04:00
Shilei Tian
48ac846fbc
[AMDGPU][GlobalISel] Align selectVOP3PMadMixModsImpl with the SelectionDAG counterpart (#110168)
The current `selectVOP3PMadMixModsImpl` can produce `V_MAD_FIX_F32`
instruction
that violates constant bus restriction, while its `SelectionDAG`
counterpart
doesn't. The culprit is in the copy stripping while the `SelectionDAG`
version
only has a bitcast stripping. This PR simply aligns the two version.
2024-10-08 09:41:24 -04:00
Christudasan Devadasan
6636f32615
[AMDGPU] Include WWM register spill into BB Prolog (#111496)
With #93526 we split the regalloc pipeline further
to have a standalone allocation for wwm registers
and per-lane VGPRs. Currently the presence of the
wwm-spill reloads inserted at the bb-top limits the
isBasicPrologue function during the per-lane vgpr
regalloc to skip past the exec manipulation instruction
and ended up causing incorrect codegen. The wmm-spill
inserted during the wwm-regalloc pipeline should also
be included in the bb-prolog so that the per-lane vgpr
regalloc pipeline can identify the appropriate insertion
points for their spills and copies.
2024-10-08 15:13:12 +05:30
Matt Arsenault
5f94b0cbdd
AMDGPU: Try to reuse dest reg for s_add_i32 frame indexes (#111201)
Hack around the register scavenger doing the wrong thing.
It does not find the result register as available in the
case the frame index add isn't also reading the dest register.
This is the quick fix for a regression where the scavenge would
create a broken spill of SGPR to memory. I believe this is still
broken for cases we cannot use the result register.

I'm confused about what position the scavenger iterator
is supposed to be in, and what RestoreAfter is for. The scavenger
is missing a full set of forward/backward APIs and there seems
to be an off by one somewhere.
2024-10-07 18:01:24 +04:00
Paul Walker
02dd6b1014
[LLVM][CodeGen] Add lowering for scalable vector bfloat operations. (#109803)
Specifically:
  fabs, fadd, fceil, fdiv, ffloor, fma, fmax, fmaxnm, fmin, fminnm,
  fmul, fnearbyint, fneg, frint, fround, froundeven, fsub, fsqrt &
  ftrunc
2024-10-07 13:01:59 +01:00
Pierre van Houtryve
924a64a348
[AMDGPU] Only emit SCOPE_SYS global_wb (#110636)
global_wb with scopes lower than SCOPE_SYS is unnecessary for
correctness.

I was initially optimistic they would be very cheap no-ops but they can
actually be quite expensive so let's avoid them.
2024-10-07 07:35:31 +02:00
Austin Kerbow
c4d89203f3
[AMDGPU] Support preloading hidden kernel arguments (#98861)
Adds hidden kernel arguments to the function signature and marks them
inreg if they should be preloaded into user SGPRs. The normal kernarg
preloading logic then takes over with some additional checks for the
correct implicitarg_ptr alignment.

Special care is needed so that metadata for the hidden arguments is not
added twice when generating the code object.
2024-10-06 17:44:33 -07:00
NAKAMURA Takumi
e075dcf7d2 Revert "Reapply "[AMDGPU][GlobalISel] Fix load/store of pointer vectors, buffer.*.pN (#110714)" (#111059)"
This reverts commit 98a15c7b0c6ec129d371f0c121dbe9396c4f5609.
(llvmorg-20-init-8051-g98a15c7b0c6e)
2024-10-06 10:50:51 +09:00
Yaxun (Sam) Liu
3b88805ca2
[AMDGPU] Fix SDWA commuting (#106920)
SDWA insts miss reverse opcode, which causes them to be treated as
commutable with default reverse opcode i.e. their own opcode. As a
result, SWDA F16 sub A, B and Sub B, A are merged by machine CSE. The
correct behavior is to merged sub A, B and subrev B, A instead of sub B,
A. This issues caused failures in rocFFT tests.

Another issue is that src0_sel and src1_sel are not swapped when SDWA
insts are commuted.

Verified that this fixes rocFFT tests failure.
2024-10-04 15:53:40 -04:00
Krzysztof Drewniak
98a15c7b0c
Reapply "[AMDGPU][GlobalISel] Fix load/store of pointer vectors, buffer.*.pN (#110714)" (#111059)
This reverts commit 650c41aad2eb43c634a05b2b5799a0c13a73b92f.

The test failures appear to be from conflicts with other PRs that landed around this time.
2024-10-04 12:33:26 -05:00
Matt Arsenault
e5a0c30e4a AMDGPU: Work around machine verifier failure with convergence tokens
Apparently any function with convergence tokens will fail the
machine verifier after register allocation. The existing codegen tests
for tokens use stop-before, and do not run to the end. Work around this
by splitting out tests with convergence tokens. Fixes EXPENSIVE_CHECKS
bot failures after c08d7b3de7409aecadd7f9edfe0f3a1ce28a6374 and
428ae0f12e29eff1ddcaf59bdcce904ec056963e
2024-10-04 16:59:23 +04:00
Matt Arsenault
428ae0f12e
AMDGPU: Do not tail call if an inreg argument requires waterfalling (#111002)
If we have a divergent value passed to an outgoing inreg argument,
the call needs to be executed in a waterfall loop and thus cannot
be tail called.

The waterfall handling of arbitrary calls is broken on the selectiondag
path, so some of these cases still hit an error later.

I also noticed the argument evaluation code in isEligibleForTailCallOptimization
is not correctly accounting for implicit argument assignments. It also seems
inreg codegen is generally broken; we are assigning arguments to the reserved
private resource descriptor.
2024-10-04 00:04:02 +04:00
Matt Arsenault
c08d7b3de7
AMDGPU: Fix verifier error on tail call target in vgprs (#110984)
We allow tail calls of known uniform function pointers. This
would produce a verifier error if the uniform value is in VGPRs.
Insert readfirstlanes just in case this occurs, which will fold
out later if it is unnecessary.

GlobalISel should need a similar fix, but it currently does not
attempt tail calls of indirect calls.

Fixes #107447
Fixes subissue of #110930
2024-10-03 21:50:56 +04:00
Juan Manuel Martinez Caamaño
b9bb77f0c7
[AMDGPU] Update branch-condition-and.ll to auto-generated checks (#110860) 2024-10-03 17:12:31 +02:00
vikashgu
870bdc6ea7 Reapply "[AMDGPU]Optimize SGPR spills (#93668)"
This reverts commit c2fc7f75f67039bb1ed577bc0edbd699a850cd9d. As the
dependent patch about split vgpr regalloc pipeline solved the issue(#96353).
2024-10-03 09:47:15 +00:00
NAKAMURA Takumi
650c41aad2 Revert "[AMDGPU][GlobalISel] Fix load/store of pointer vectors, buffer.*.pN (#110714)"
Some builders has been failing tests.
```
Failed Tests (2):
  LLVM :: CodeGen/AMDGPU/GlobalISel/inst-select-load-global-old-legalization.mir
  LLVM :: CodeGen/AMDGPU/GlobalISel/inst-select-load-local.mir
```

This reverts commit ae5bd2a9f292037c605b2ec0ee31200581bd8701.
(llvmorg-20-init-7805-gae5bd2a9f292)
2024-10-03 15:38:34 +09:00
Krzysztof Drewniak
ae5bd2a9f2
[AMDGPU][GlobalISel] Fix load/store of pointer vectors, buffer.*.pN (#110714)
Certain pointer address spaces were not being correctly handled by the
GlobalISel lowering for buffer_load and buffer_store.

1. ptr addrspace(1) and addrspace(4) did not have rewrite patterns
defined for them, while p0 did, since those pointer types weren't in the
list of types that was iterated to form the patterns.
2. Vectors of pointers need to be bitcast to vectors of the
corresponding scalars, since there doesn't seem to be a good way to
define the rewrite patterns for buffer_load/store of those types

The need to bitcast vectors of pointers was also revealed to affect
ordinary `G_LOAD` and `G_STORE` in some cases, so
`shouldBitcastLoadStore()` has been fixed to handle it properly.
2024-10-02 13:46:56 -05:00
Matt Arsenault
f2eeb3dc7b
AMDGPU: Handle v_add* in eliminateFrameIndex (#102346) 2024-10-02 21:19:45 +04:00
Matt Arsenault
4cd1f9ac9f
AMDGPU: Add baseline test for frame index folding (#110737)
We currently can increase the instruction count
when a frame index requires materialization.
2024-10-02 19:47:32 +04:00
Matt Arsenault
187dcd8e22
DAG: Preserve disjoint flag when emitting final instructions (#110795) 2024-10-02 19:37:04 +04:00
Janek van Oirschot
e35319524a
[AMDGPU] Fix stack size metadata for functions with direct and indirect calls (#110828)
When a function has an external call, it should still use the stack
sizes of direct, known, calls to calculate its own stack size
2024-10-02 14:52:52 +01:00
Matt Arsenault
30ea7e859b AMDGPU: Preserve flags in regbankselect when splitting or 2024-10-02 12:08:21 +04:00
Matt Arsenault
203d5158c3 AMDGPU/GlobalISel: Preserve flags when splitting select
RegBankSelect was losing flags on selects.
2024-10-02 12:08:17 +04:00
Matt Arsenault
2d3119c3d9 AMDGPU: Add more tests for frame index code quality
There are also some bugs with sgpr constraints.
2024-10-01 21:42:34 +04:00
Matt Arsenault
f61abee01a AMDGPU: Add missing tests for local stack alloc s_add_i32 handling
None of these tested the case where the non-frame index operand
was a register.
2024-10-01 19:49:18 +04:00
Fabian Ritter
16ba126a14
[AMDGPU][GlobalISel][NFC] Use amdhsa target for flat/private tests (#110672)
As a proxy criterion, mesa targets have unaligned-access-mode (which
determines whether the hardware allows unaligned memory accesses) not
set whereas amdhsa targets do. This PR changes tests to use amdhsa
instead of mesa and inserts additional checks with unaligned-access-mode
unset explicitly.

This is in preparation for PR #110219, which will generate different
code depending on the unaligned-access-mode.
2024-10-01 17:04:18 +02:00
Matt Arsenault
8395b3f60f AMDGPU: Mark scc dead when materialized frame base registers 2024-10-01 18:54:18 +04:00
Brox Chen
2672037e36
[AMDGPU][True16][MC] Support VOP3 only instructions with true16 and fake16 (#109891)
Update VOP3 only instructions with true16 and fake16 formats. 

This patch includes instructions:
V_MUL_LO_U16
V_MAX_U16
V_MAX_I16
V_MIN_U16
V_MIN_I16
V_LSHLREV_B16
V_LSHRREV_B16
V_ASHRREV_I16
2024-10-01 09:25:36 -04:00
Fabian Ritter
3ba4092c06
[AMDGPU] Check vector sizes for physical register constraints in inline asm (#109955)
For register constraints that require specific register ranges, the
width of the range should match the type of the associated
parameter/return value. With this PR, we error out when that is not the
case. Previously, these cases would hit assertions or llvm_unreachables.

The handling of register constraints that require only a single register
remains more lenient to allow narrower non-vector types for the
associated IR values. For example, constraining an i16 or i8 value to a
32-bit register is still allowed.

Fixes #101190.

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2024-10-01 10:29:35 +02:00