7909 Commits

Author SHA1 Message Date
Krzysztof Drewniak
3b0f506c87
[AMDGPU] Support nuw and nusw in buffer fat pointer lowering (#115039)
This commit usis the `nuw` flag on `getelemnetptr` to set the `nuw` flag
on buffer offset additions, and also moves from `inbounds` to the looser
`nusw` for the existing case.
2024-11-06 11:42:47 -06:00
Matt Arsenault
aa7941289e
AMDGPU: Fold copy of scalar add of frame index (#115058)
This is a pre-optimization to avoid a regression in a future
commit. Currently we almost always emit frame index with
a v_mov_b32 and use vector adds for the pointer operations. We
need to consider the users of the frame index (or rather, the
transitive users of derived pointer operations) to know whether
the value will be used in a vector or scalar context. This saves
an sgpr->vgpr copy.

This optimization could be more general for any opcode that's
trivially convertible from a scalar to vector form (although this
is a workaround for a proper regbankselect).
2024-11-06 09:10:58 -08:00
Matt Arsenault
efe87fbc9d
AMDGPU: Improve vector of pointer handling in amdgpu-promote-alloca (#114144) 2024-11-06 08:47:15 -08:00
Paul Walker
38fffa630e
[LLVM][IR] Use splat syntax when printing Constant[Data]Vector. (#112548) 2024-11-06 11:53:33 +00:00
Stanislav Mekhanoshin
6d7e51de5e
[AMDGPU] Extend type support for update_dpp intrinsic (#114597)
We can split 64-bit DPP as a post-RA pseudo if control values are
supported, but cannot handle other types.
2024-11-05 13:59:14 -08:00
Brox Chen
e8644e3b47
[AMDGPU][True16][MC] VOP2 update instructions with fake16 format (#114436)
Some old "t16" VOP2 instructions are actually in fake16 format. Correct
and update test file
2024-11-05 16:12:49 -05:00
Matt Arsenault
0b40f97929
AMDGPU: Treat uint32_max as the default value for amdgpu-max-num-workgroups (#113751)
0 does not make sense as a value for this to be, much less the default.
Also stop emitting each individual field if it is the default, rather than
if any element was the default. Also fix the name of the test since it didn't
exactly match the real attribute name.
2024-11-05 12:50:44 -08:00
Matt Arsenault
ce067c5a3b AMDGPU: Rename test file 2024-11-05 10:42:12 -08:00
Matt Arsenault
75d673718a
AMDGPU: Fix clobbering temp reg for large frame indexes in VOP3 users (#114924)
For a VOP3 instruction that does not permit a literal operand with an
SGPR operand, this would re-use the same scratch register for both
operands,
clobbering the original value.
2024-11-05 07:37:23 -08:00
Craig Topper
999dfb2067
[GISel][AArch64][AMDGPU][RISCV] Canonicalize (sub X, C) -> (add X, -C) (#114309)
This matches InstCombine and DAGCombine.

RISC-V only has an ADDI instruction so without this we need additional
patterns to do the conversion.

Some of the AMDGPU tests look like possible regressions. Maybe some
patterns from isel aren't imported.
2024-11-04 17:20:11 -08:00
Matt Arsenault
002a0a27bc
AMDGPU: Fix broken frame index expansion for v_add_co_u32_e64 (#114634)
With an explicit carry out operand, one too many operands were deleted
resulting in a malformed v_mov_b32.
2024-11-04 10:39:50 -08:00
Matt Arsenault
30dd1297fa
AMDGPU: Custom expand flat cmpxchg which may access private (#109410)
64-bit flat cmpxchg instructions do not work correctly for scratch
addresses, and need to be expanded as non-atomic.

Allow custom expansion of cmpxchg in AtomicExpand, as is
already the case for atomicrmw.
2024-11-04 09:29:38 -08:00
Jay Foad
f8559751fc
[llvm-project] Fix typo "propogate" (#114795) 2024-11-04 15:33:19 +00:00
Shilei Tian
390300d9f4
[PassBuilder] Add ThinOrFullLTOPhase to optimizer pipeline (#114577) 2024-11-03 23:25:29 -05:00
Shilei Tian
dc45ff1d2a
[PassBuilder] Add ThinOrFullLTOPhase to early simplication EP call backs (#114547)
The early simplication pipeline is used in non-LTO and (Thin/Full)LTO
pre-link
stage. There are some passes that we want them in non-LTO mode, but not
at LTO
pre-link stage. The control is missing currently. This PR adds the
support. To
demonstrate the use, we only enable the internalization pass in non-LTO
mode for
AMDGPU because having it run in pre-link stage causes some issues.
2024-11-03 23:24:10 -05:00
Matt Arsenault
8e61aaa021
AMDGPU: Fix illegal commute with frame index (#114497)
In ca409892c5396fa3fbb8ea4dbf53d0e952f36d09, frame indexes started
being treated more like registers, rather than immediates. Update
the commute logic to avoid failing the verifier by moving illegal
SGPR operands in place of a frame index.
2024-11-01 10:02:29 -07:00
Krzysztof Drewniak
ea33af63de
Reapply "[AMDGPU][GlobalISel] Fix load/store of pointer vectors, buffer.*.pN (#110714)" v3 (#114443)
This reverts commit 8a849a2a567d4e519b246a16936b6e7519936d4b.

It seems I missed a spot when trying to ensure the code in the
instruction selection tests were actually legalized MIR.
2024-11-01 11:13:29 -05:00
Ruiling, Song
54d31bde32
Reapply "StructurizeCFG: Optimize phi insertion during ssa reconstruction (#101301)" (#114347)
This reverts commit be40c723ce2b7bf2690d22039d74d21b2bd5b7cf.
2024-11-01 08:29:59 +08:00
Matt Arsenault
e3222e6f80
AMDGPU: Add baseline tests for cmpxchg custom expansion (#109408)
We need a non-atomic path if flat may access private.
2024-10-31 11:46:13 -07:00
Simon Pilgrim
9fb4bc5bf4
[DAG] SimplifyMultipleUseDemandedBits - ignore SRL node if we're just demanding known sign bits (#114389)
Check to see if we are only demanding (shifted) signbits from a SRL node that are also signbits in the source node.

We can't demand any upper zero bits that the SRL will shift in (up to max shift amount), and the lower demanded bits bound must already be all signbits.
2024-10-31 16:40:29 +00:00
Matt Arsenault
1d0370872f
AMDGPU: Expand flat atomics that may access private memory (#109407)
If the runtime flat address resolves to a scratch address,
64-bit atomics do not work correctly. Insert a runtime address
space check (which is quite likely to be uniform) and select between
the non-atomic and real atomic cases.

Consider noalias.addrspace metadata and avoid this expansion when
possible (we also need to consider it to avoid infinitely expanding
after adding the predication code).
2024-10-31 08:08:48 -07:00
Matt Arsenault
db5bcb24c2
GlobalISel: Fix combine duplicating atomic loads (#111730)
The sext_inreg (load) combine was not deleting the old load instruction,
and it would never be deleted if volatile or atomic.
2024-10-31 07:55:12 -07:00
Matt Arsenault
12409024d3
AMDGPU/GlobalISel: Handle atomic sextload and zextload (#111721)
Atomic loads are handled differently from the DAG, and have separate opcodes
and explicit control over the extensions, like ordinary loads. Add
new patterns for these.

There's room for cleanup and improvement. d16 cases aren't handled.

Fixes #111645
2024-10-31 07:44:52 -07:00
Stanislav Mekhanoshin
7cd29741fa
[AMDGPU] Extend mov_dpp8 intrinsic lowering for generic types (#114296)
The int_amdgcn_mov_dpp8 is overloaded, but we can only select i32.
To allow a corresponding builtin to be overloaded the same way as
int_amdgcn_mov_dpp we need it to be able to split unsupported values.
2024-10-31 01:15:25 -07:00
Changpeng Fang
ca1154d1d4
AMDGPU: Disable pattern matching "x<<32-y>>32-y" to "bfe x, 0, y" (#114279)
It is not correct to lower "x<<32-y>>32-y" to "bfe x, 0, y". When y
equals to 32, the left-hand side is still x (unchanged), however, the
right-hand side will be evaluated to 0. So it is not always correct to
do such transformation.

We may be able to keep the pattern for immediate y while y is within [0,
31]. However, the immediate operands of the sub (32 - y) are easily
folded, and "(x << imm) >> imm" will be lowered to "and x,
(2^(32-imm))-1" anyway. So no bfe matching is needed.
2024-10-30 11:07:15 -07:00
Jay Foad
311c0772f9 [AMDGPU] Fix test failures after #114232 and #114200 2024-10-30 16:51:44 +00:00
Jay Foad
6bf4476ffb
[AMDGPU] Fix @llvm.amdgcn.cs.chain with callee not provably uniform (#114200)
The correct behavior is to insert a readfirstlane. This worked except
for an inappropriate assertion in SITargetLowering::LowerCall.
2024-10-30 16:18:29 +00:00
Jay Foad
8ee5e19c87
[AMDGPU] Fix @llvm.amdgcn.cs.chain with SGPR args not provably uniform (#114232)
The correct behaviour is to insert a readfirstlane. SelectionDAG was
already doing this in some cases, but not in the general case for chain
calls. GlobalISel was already doing this for return values but not for
arguments.
2024-10-30 16:12:37 +00:00
Akshat Oke
44d0e9522a
[CodeGen][NewPM] Port TailDuplicate pass to NPM (#113293) 2024-10-30 11:48:40 +05:30
Matt Arsenault
6d9fc1b846
AMDGPU: Fix producing invalid IR on vector typed getelementptr (#114113)
This did not consider the IR change to allow a scalar base with a vector
offset part. Reject any users that are not explicitly handled.

In this situation we could handle the vector GEP, but that is a larger
change. This just avoids the IR verifier error by rejecting it.
2024-10-29 22:14:24 -07:00
Jay Foad
a156362e93
[AMDGPU] Fix machine verification failure after SIFoldOperandsImpl::tryFoldOMod (#113544)
Fixes #54201
2024-10-29 14:59:37 +00:00
Matt Arsenault
88e23eb2cf
DAG: Fix legalization of vector addrspacecasts (#113964) 2024-10-29 08:08:50 -05:00
Matt Arsenault
1ceccbb0dd
VirtRegRewriter: Add implicit register defs for live out undef lanes (#112679)
If an undef subregister def is live into another block, we need to
maintain a physreg def to track the liveness of those lanes. This
would manifest a verifier error after branch folding, when the cloned
tail block use no longer had a def.

We need to detect interference with other assigned intervals to avoid
clobbering the undef lanes defined in other intervals, since the undef
def didn't count as interference. This is pretty ugly and adds a new
dependency on LiveRegMatrix, keeping it live for one more pass. It also
adds a lot of implicit operand spam (we really should have a better
representation for this).

There is a missing verifier check for this situation. Added an xfailed
test that demonstrates this. We may also be able to revert the changes
in 47d3cbcf842a036c20c3f1c74255cdfc213f41c2.

It might be better to insert an IMPLICIT_DEF before the instruction
rather than using the implicit-def operand.

Fixes #98474
2024-10-28 17:33:53 -07:00
Fabian Ritter
a4fd3dba6e
[AMDGPU] Use wider loop lowering type for LowerMemIntrinsics (#112332)
When llvm.memcpy or llvm.memmove intrinsics are lowered as a loop in
LowerMemIntrinsics.cpp, the loop consists of a single load/store pair
per iteration. We can improve performance in some cases by emitting
multiple load/store pairs per iteration. This patch achieves that by
increasing the width of the loop lowering type in the GCN target and
letting legalization split the resulting too-wide access pairs into
multiple legal access pairs.

This change only affects lowered memcpys and memmoves with large (>=
1024 bytes) constant lengths. Smaller constant lengths are handled by
ISel directly; non-constant lengths would be slowed down by this change
if the dynamic length was smaller or slightly larger than what an
unrolled iteration copies.

The chosen default unroll factor is the result of microbenchmarks on
gfx1030. This change leads to speedups of 15-38% for global memory and
1.9-5.8x for scratch in these microbenchmarks.

Part of SWDEV-455845.
2024-10-28 09:04:19 +01:00
Gaëtan Bossu
a0c318938a
[CodeGen][NFC] Properly split MachineLICM and EarlyMachineLICM (#113573)
Both are based on MachineLICMBase, and the functionality there is
"switched" based on a PreRegAlloc flag. This commit is simply about
trusting the original value of that flag, defined by the `MachineLICM`
and `EarlyMachineLICM` classes.

The `PreRegAlloc` flag used to be overwritten it based on MRI.isSSA(),
which is un-reliable due to how it is inferred by the MIRParser. I see
that we can now define isSSA in MIR (thanks @gargaroff ), meaning the
fix isn’t really needed anymore, but redefining that flag still feels
wrong.

Note that I'm looking into upstreaming more changes to MachineLICM, see
[the discourse
thread](https://discourse.llvm.org/t/extending-post-regalloc-machinelicm/82725).
2024-10-25 11:19:22 -07:00
Jun Wang
19b0453361
[AMDGPU][MC] Fix disassembler problem for image_atomic with TFE (#112622)
For image_atomic instructions with TFE, in some cases (e.g., when
dmask=3) the disassembler produces dst register with wrong size (e.g.,
image_atomic_smin v5, v1, s[8:15] dmask:0x3 tfe, instead of v[5:7]).
This patch fixes the VDataDwords values for image atomic instructions.
2024-10-24 16:19:18 -07:00
Carl Ritson
076aac59ac
[AMDGPU] Add a new target for gfx1153 (#113138) 2024-10-23 12:56:58 +09:00
Janek van Oirschot
a18826d75c
[AMDGPU] Create local KnownBits in case DenseMap gets invalidated (#111568)
KnownBits retrieved from DenseMap may invalidate if insertion requires a
(re)growth.

Fixes https://github.com/llvm/llvm-project/issues/110930
2024-10-22 16:05:07 +01:00
Fabian Ritter
4c697f7037
[LowerMemIntrinsics] Use i8 GEPs in memcpy/memmove lowering (#112707)
The IR lowering of memcpy/memmove intrinsics uses a target-specific type
for its load/store operations. So far, the loaded and stored addresses
are computed with GEPs based on this type. That is wrong if the
allocation size of the type differs from its store size: The width of
the accesses is determined by the store size, while the GEP stride is
determined by the allocation size. If the allocation size is greater
than the store size, some bytes are not copied/moved.

This patch changes the GEPs to use i8 addressing, with offsets based on
the type's store size. The correctness of the lowering therefore no
longer depends on the type's allocation size.

This is in support of PR #112332, which allows adjusting the memcpy loop
lowering type through a command line argument in the AMDGPU backend.
2024-10-22 16:48:50 +02:00
Akshat Oke
ca32bd643b
[NewPM][AMDGPU] Port SIPreAllocateWWMRegs to NPM (#109939) 2024-10-22 15:37:08 +05:30
Akshat Oke
f8cb526076
[AMDGPU] Add tests for SIPreAllocateWWMRegs (#109963) 2024-10-22 15:33:46 +05:30
Fabian Ritter
69abfd3141
[AMDGPU] Allow casts between the Global and Constant Addr Spaces in isValidAddrSpaceCast (#112493)
So far, isValidAddrSpaceCast only allows casts to the flat address
space and between the constant(32) address spaces. It does not allow
casting between the global and constant address spaces, even though they
alias. That affects, e.g., the lowering of memmoves from the constant to
the global address space in LowerMemIntrinsics, since that requires
aliasing address spaces to be castable.

This patch relaxes isValidAddrSpaceCast and allows such casts. It also
includes a memmove test that would crash with the previous
implementation because the memmove IR lowering would not be
applicable for the move from constant AS to global AS.
2024-10-22 09:33:21 +02:00
Stanislav Mekhanoshin
3277c7cd28
[AMDGPU] Skip VGPR deallocation for waveslot limited kernels (#112765)
MSG_DEALLOC_VGPRS slows down very small waveslot limited kernels. It's
been identified this message is only really needed for VGPR limited
kernels. A kernel becomes VGPR limited if a total number of VGPRs per
SIMD / number of used VGPRs is more than a number of wave slots.
2024-10-21 09:39:52 -07:00
Christudasan Devadasan
3c5cea650d
[AMDGPU]: Add implicit-def to the BB prolog (#112872)
IMPLICIT_DEF inserted for a wwm-register at the
very first block or the predecessor block where
it is used for sgpr spilling can appear at a block
begin that requires spill-insertion during per-lane
VGPR regalloc phase. The presence of the IMPLICIT_DEF
currently breaks the BB prolog.

Fixes: SWDEV-490717
2024-10-21 13:21:16 +05:30
Matt Arsenault
ef91cd3f01
AMDGPU: Handle folding frame indexes into add with immediate (#110738) 2024-10-19 12:33:03 -07:00
Alex Rønne Petersen
ad4a582fd9
[llvm] Consistently respect naked fn attribute in TargetFrameLowering::hasFP() (#106014)
Some targets (e.g. PPC and Hexagon) already did this. I think it's best
to do this consistently so that frontend authors don't run into
inconsistent results when they emit `naked` functions. For example, in
Zig, we had to change our emit code to also set `frame-pointer=none` to
get reliable results across targets.

Note: I don't have commit access.
2024-10-18 09:35:42 +04:00
Shilei Tian
92663defb1
[NFC][AMDGPU] Auto-generate check lines for some test cases (#112426)
- `llvm/test/CodeGen/AMDGPU/andorbitset.ll`
- `llvm/test/CodeGen/AMDGPU/andorxorinvimm.ll`
- `llvm/test/CodeGen/AMDGPU/fabs.f64.ll`
- `llvm/test/CodeGen/AMDGPU/llvm.amdgcn.raw.ptr.buffer.store.ll`
- `llvm/test/CodeGen/AMDGPU/s_mulk_i32.ll`
2024-10-17 10:55:29 -04:00
Brox Chen
35e937b4de
[AMDGPU][True16][CodeGen] fp conversion in true/fake16 format (#101678)
fp conversion V_CVT_F_F/V_CVT_F_U instructions true16 format were
previously implemented using fake16 profile.

With the MC support inplace, correct and support these instructions in
true16/fake16 format in CodeGen
2024-10-16 12:26:01 -04:00
Brox Chen
7b4c8b35d4
[AMDGPU][True16][MC] VOP3 profile in True16 format (#109031)
Modify VOP3 profile and pesudo, and add encoding info for VOP3 True16
including DPP and DPP8 in true16 and fake16 format.

This patch applies true16/fake16 changes and asm/dasm changes to
V_ADD_NC_U16
V_ADD_NC_I16
V_SUB_NC_U16
V_SUB_NC_I16
2024-10-16 10:27:44 -04:00
Petar Avramovic
14d006c53c
AMDGPU/GlobalISel: Run redundant_and combine in RegBankCombiner (#112353)
Combine is needed to clear redundant ANDs with 1 that will be
created by reg-bank-select to clean-up high bits in register.
Fix replaceRegWith from CombinerHelper:
If copy had to be inserted, first create copy then delete MI.
If MI is deleted first insert point is not valid.
2024-10-16 09:43:16 +02:00