1495 Commits

Author SHA1 Message Date
yingopq
754ed95b66
[Mips] Fix compiler crash when returning fp128 after calling a functi… (#117525)
…on returning { i8, i128 }

Fixes https://github.com/llvm/llvm-project/issues/96432.
2025-01-20 16:47:40 +08:00
Kazu Hirata
bfb6bb69fd [AMDGPU] Fix a warning
This patch fixes:

  llvm/lib/Target/AMDGPU/SIISelLowering.cpp:13908:46: error:
  comparison of integers of different signs: 'uint32_t' (aka 'unsigned
  int') and 'int' [-Werror,-Wsign-compare]
2025-01-16 22:40:08 -08:00
Vikram Hegde
225fc4f356
[AMDGPU][SDAG] Try folding "lshr i64 + mad" to "mad_u64_u32" (#119218)
The intention is to use a "copy" instead of a "sub" to handle the high
parts of 64-bit multiply for this specific case.

This unlocks copy prop use cases where the copy can be reused by later
multiply+add sequences if possible.

Fixes: SWDEV-487672, SWDEV-487669
2025-01-17 11:09:39 +05:30
Matt Arsenault
ca95519704
AMDGPU: Implement isExtractVecEltCheap (#122460)
Once again we have excessive TLI hooks with bad defaults. Permit this
for 32-bit element vectors, which are just use-different-register.
We should permit 16-bit vectors as cheap with legal packed instructions,
but I see some mixed improvements and regressions that need investigation.
2025-01-17 08:38:01 +07:00
Acim Maravic
cc3aab580b
[AMDGPU] Handle nontemporal and amdgpu.last.use metadata in amdgpu-lower-buffer-fat-pointers (#120139) 2025-01-14 11:22:20 +01:00
Mirko Brkušanin
3def49cb64
[AMDGPU] Remove s_wakeup_barrier instruction (#122277) 2025-01-10 11:30:22 +01:00
Matt Arsenault
46ca6dfb5f
AMDGPU: Add disjoint to or produced from lowering vector ops (#122424) 2025-01-10 16:21:53 +07:00
Jay Foad
fd922c4b4f
[CodeGen] Add const to getAddrModeArguments argument. NFC. (#122335) 2025-01-10 09:19:25 +00:00
Chinmay Deshpande
211bcf67aa
[AMDGPU] Implement IR variant of isFMAFasterThanFMulAndFAdd (#121465) 2025-01-10 09:05:41 +05:30
Matt Arsenault
d2b78c646b
AMDGPU: Custom lower bf16 shuffles (#122252)
We already custom lower the other 16-bit element type shuffles.
2025-01-09 21:37:27 +07:00
Matt Arsenault
09583dec15
AMDGPU: Reduce 64-bit add width if low bits are known 0 (#122049)
If one of the inputs has all 0 bits, the low part cannot
carry and we can just pass through the original value.

Add case: https://alive2.llvm.org/ce/z/TNc7hf
Sub case: https://alive2.llvm.org/ce/z/AjH2-J

We could do this in the general case with computeKnownBits,
but add is so common this could be potentially expensive for
something which will fire infrequently.

One potential concern is this could break the 64-bit add
we expect to see for addressing mode matching, but these
constants shouldn't appear often in addressing expressions.
One test for large offset expressions changes but isn't worse.

Fixes https://github.com/ROCm/llvm-project/issues/237
2025-01-08 22:33:54 +07:00
Aaditya
0bd1c87996
[AMDGPU] Support divergent sized dynamic alloca (#121148)
Currently, AMDGPU backend can handle uniform-sized dynamic allocas. 
This patch extends support for divergent-sized dynamic allocas.
When the size argument of a dynamic alloca is divergent, 
a wave-wide reduction is performed to get the required stack space. 
`@llvm.amdgcn.wave.reduce.umax` is used to perform the 
wave reduction.

Dynamic allocas are not completely supported yet, 
as the stack is not properly restored on function exit.
This patch doesn't attempt to address the aforementioned issue.

Note: Compiler already Zero-Extends or Truncates all other 
types(of alloca size arg) to i32.
2025-01-06 12:28:24 +07:00
Aaditya
c7606710f9
[AMDGPU] Update base addr of dyn alloca considering GrowingUp stack (#119822)
Currently, compiler calculates the base address of
dynamic sized stack object (alloca) as follows:
1. `NewSP = Align(CurrSP + Size)`
_where_ `Size = # of elements * wave size * alloca type`
2. `BaseAddr = NewSP`
3. The alignment is computed as: `AlignedAddr = Addr & ~(Alignment - 1)`
4. Return the `BaseAddr`
This makes sense when stack is grows downwards.

AMDGPU stack grows upwards, the base address 
needs to be aligned first and SP bump by required size later:
1. `BaseAddr = Align(CurrSP)`
2. `NewSP = BaseAddr + Size`
3. `AlignedAddr = (Addr + (Alignment - 1)) & ~(Alignment - 1)`
4. and returns the `BaseAddr`.
2024-12-20 10:27:27 +05:30
Craig Topper
f139bde8d8
[SelectionDAG] Move SDNode::use_iterator::getOperandNo to SDUse. (#120536)
This allows us to write more range based for loops because we no
longer need the iterator. It also matches IR's Use class.
2024-12-19 09:07:42 -08:00
Craig Topper
e6b2495545
[SelectionDAG] Split SDNode::use_iterator into user_iterator and use_iterator. (#120531)
SDNode::use_iterator now returns an SDUse& when dereferenced.
SDNode::user_iterator returns SDNode*. SDNode::use_begin/use_end/uses
work on use_iterator. SDNode::user_begin/user_end/users work on
user_iterator.

We can now write range based for loops using SDUse& and SDNode::uses().
I've converted many of these in this patch. I didn't update loops that
have additional variables updated in their for statement.

Some loops use SDNode::use_iterator::getOperandNo() which also prevents
using range based for loops. I plan to move this into SDUse in a follow
up patch.
2024-12-19 08:35:32 -08:00
Craig Topper
bd261ecc5a
[SelectionDAG] Add SDNode::user_begin() and use it in some places (#120509)
Most of these are just places that want the first user and aren't
iterating over the whole list.

While there I changed some use_size() == 1 to hasOneUse() which
is more efficient.

This is part of an effort to rename use_iterator to user_iterator
and provide a use_iterator that dereferences to SDUse&. This patch
helps reduce the diff on later patches.
2024-12-18 22:13:04 -08:00
Craig Topper
104ad9258a
[SelectionDAG] Rename SDNode::uses() to users(). (#120499)
This function is most often used in range based loops or algorithms
where the iterator is implicitly dereferenced. The dereference returns
an SDNode * of the user rather than SDUse * so users() is a better name.

I've long beeen annoyed that we can't write a range based loop over
SDUse when we need getOperandNo. I plan to rename use_iterator to
user_iterator and add a use_iterator that returns SDUse& on dereference.
This will make it more like IR.
2024-12-18 20:09:33 -08:00
Aaditya
0ae75eba67
[AMDGPU] Assert if stack grows downwards. (#119888) 2024-12-14 17:44:40 +05:30
Piotr Sobczak
a2d086af2c
[AMDGPU] Fix FMA combine (#119217)
Update the check in the FMA combine to check dot10-insts instead of
dot7-insts.

The target of the combine, v_dot2_f32_f16, is available only if
dot10-insts target feature is enabled.

The issue probably dates back to the change that split out dot10-insts
out of dot7-insts.

As far as I can see, this does not affect any current targets, but if a
future target has dot7-insts, but not dot10-insts that would cause a
crash ("cannot select") for the input ir in the test.
2024-12-10 10:11:19 +01:00
Vikash Gupta
0b0d9a3bee
[CodeGen] [AMDGPU] Attempt DAGCombine for fmul with select to ldexp (#111109)
The materialization cost of 32-bit non-inline in case of fmul is quite
relatively more, rather than if possible to combine it into ldexp
instruction for specific scenarios (for datatypes like f64, f32 and f16)
as this is being handled here :

The dag combine for any pair of select values which are exact exponent
of 2.

```
fmul x, select(y, A, B)       -> ldexp (x, select i32 (y, a, b))
fmul x, select(y, -A, -B)    -> ldexp ((fneg x), select i32 (y, a, b))

where, A=2^a & B=2^b ; a and b are integers.  
```

This dagCombine is handled separately in fmulCombine (newly defined in
SIIselLowering), targeting fmul fusing it with select type operand into
ldexp.

Thus, it fixes #104900.
2024-12-09 12:52:04 +05:30
Austin Kerbow
b1d42465fc
[AMDGPU] Fix hidden kernarg preload count inconsistency (#116759)
It is possible that the number of hidden arguments that are selected to
be preloaded in AMDGPULowerKernel arguments and isel can differ. This
isn't an issue with explicit arguments since isel can lower the argument
correctly either way, but with hidden arguments we may have alignment
issues if we try to load these hidden arguments that were added to the
kernel signature.

The reason for the mismatch is that isel reserves an extra synthetic
user SGPR for module LDS.

Instead of teaching lowerFormalArguments how to handle these properly it
makes more sense and is less expensive to fix the mismatch and assert if
we ever run into this issue again. We should never be trying to lower
these in the normal way.

In a future change we probably want to revise how we track "synthetic"
user SGPRs and unify the handling in GCNUserSGPRUsageInfo. Sometimes
synthetic SGPRSs are considered user SGPRs and sometimes they are not.
Until then this patch resolves the inconsistency, fixes the bug, and is
otherwise a NFC.
2024-12-08 10:10:08 -08:00
Jakub Chlanda
edbebda454
[AMDGPU] Assert previous SGPR exists when bundling preloaded args (#118802)
This came up from a downstream static analysis tool.
2024-12-06 09:15:22 +01:00
Matt Arsenault
15676ec552
AMDGPU: Add support for V_CVT_PK_F16_F32 instruction for gfx950 (#118300)
Co-authored-by: Shilei Tian <shilei.tian@amd.com>
2024-12-02 16:04:24 -05:00
Matt Arsenault
7221bc74bc
AMDGPU: Make v2f16 minimum/maximum legal for gfx950 (#117738) 2024-11-26 14:51:05 -05:00
Matt Arsenault
f5e92eb04b
AMDGPU: Handle f32 minimum3/maximum3 pattern for gfx950 (#117737) 2024-11-26 14:47:52 -05:00
Matt Arsenault
e57b327be2
AMDGPU: Legalize fminimum and fmaximum f32 for gfx950 (#117634)
Select to minimum3/maximum3. Leave f16/v2f16 for later
since it's complicated by only having the vector version.
2024-11-26 14:44:09 -05:00
Piotr Sobczak
a96ec01e1a
[AMDGPU] Optimize out s_barrier_signal/_wait (#116993)
Extend the optimization that converts s_barrier to wave_barrier (nop)
when the number of work items is not larger than wave size.
    
This handles the "split barrier" form of s_barrier where the barrier
is represented by separate intrinsics (s_barrier_signal/s_barrier_wait).
Note: the version where s_barrier is used in gfx12 (and later split)
has the optimization already, but some front-ends may prefer to use
split intrinsics and this is being addressed by the patch.
2024-11-26 10:04:32 +01:00
Craig Topper
bc282605df
[SelectionDAG] Require last operand of (STRICT_)FP_ROUND to be a TargetConstant. (#117639)
Fix all the places I could find that did't do this. We were already
mostly correct for FP_ROUND after
9a976f36615dbe15e76c12b22f711b2e597a8e51, but not STRICT_FP_ROUND.
2024-11-25 21:36:33 -08:00
Matt Arsenault
e97fb2207e
AMDGPU: Add support for load transpose instructions for gfx950 (#117378)
This patch support for intrinsics in clang, as well as assembly
instructions in the backend.

Co-authored-by: Sirish Pande <Sirish.Pande@amd.com>
2024-11-25 09:39:04 -08:00
Nikita Popov
3317c9ceac
[AMDGPU] Use getSignedConstant() where necessary (#117328)
Create signed constant using getSignedConstant(), to avoid future
assertion failures when we disable implicit truncation in getConstant().

This also touches some generic legalization code, which apparently only
AMDGPU tests.
2024-11-25 09:49:34 +01:00
Matt Arsenault
1944d192bd
AMDGPU: Use isWave[32|64] instead of comparing size value (#117411) 2024-11-23 09:30:57 -08:00
Matt Arsenault
bd8a953e9b
AMDGPU: Fix mfma scale source legalization (#117238)
Code inside assert changes the variable instead of the comparison.

Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2024-11-21 15:30:01 -08:00
Matt Arsenault
01c9a14ccf
AMDGPU: Define v_mfma_f32_{16x16x128|32x32x64}_f8f6f4 instructions (#116723)
These use a new VOP3PX encoding for the v_mfma_scale_* instructions,
which bundles the pre-scale v_mfma_ld_scale_b32. None of the modifiers
are supported yet (op_sel, neg or clamp).

I'm not sure the intrinsic should really expose op_sel (or any of the
others). If I'm reading the documentation correctly, we should be able
to just have the raw scale operands and auto-match op_sel to byte
extract patterns.

The op_sel syntax also seems extra horrible in this usage, especially with the
usual assumed op_sel_hi=-1 behavior.
2024-11-21 08:51:58 -08:00
Jay Foad
ade0750e35
[AMDGPU] Fix some cache policy checks for GFX12+ (#116396)
Fix coding errors found by inspection and check that the swz bit still
serves to prevent merging of buffer loads/stores on GFX12+.
2024-11-21 08:22:59 +00:00
Matt Arsenault
927032807d
AMDGPU: Handle gfx950 96/128-bit buffer_load_lds (#116681)
Enforcing this limit in the clang builtin will come later.
2024-11-18 22:01:56 -08:00
Matt Arsenault
50224bd5ba
AMDGPU: Handle gfx950 global_load_lds_* instructions (#116680)
Define global_load_lds_dwordx3 and global_load_dwordx4.
Oddly it seems dwordx2 was skipped.
2024-11-18 21:58:02 -08:00
Matt Arsenault
738bdd4969
AMDGPU: Add V_CVT_PK_BF16_F32 for gfx950 (#116678) 2024-11-18 21:50:54 -08:00
Sergei Barannikov
baf59be89b
[SelectionDAG] Fix return types of TC_RETURN for several targets (#116504)
TC_RETURN nodes do not have a glue result.
2024-11-17 02:14:05 +03:00
Kazu Hirata
be187369a0
[AMDGPU] Remove unused includes (NFC) (#116154)
Identified with misc-include-cleaner.
2024-11-13 21:10:03 -08:00
Kazu Hirata
4048c64306
[llvm] Remove redundant control flow statements (NFC) (#115831)
Identified with readability-redundant-control-flow.
2024-11-12 10:09:42 -08:00
Shilei Tian
6548b6354d Reapply "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403)"
This reverts commit ca33649abe5fad93c57afef54e43ed9b3249cd86.
2024-11-08 20:21:16 -05:00
Shilei Tian
ca33649abe Revert "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403)"
This reverts commit e215a1e27d84adad2635a52393621eb4fa439dc9 as it broke both
hip and openmp buildbots.
2024-11-08 16:36:35 -05:00
Shilei Tian
e215a1e27d
[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403) 2024-11-08 13:05:35 -05:00
Jeffrey Byrnes
ae6dbed594
[AMDGPU] Use correct DWord for v_dot4 S0 operand (#115224)
Fixes a copy-paste typo.

The typo resulted in producing bad v_perm based operands for the v_dot4
combine. When adding a corresponding byte pair to the v_dot byte pair
chains, we must take note of the byte position in the corresponding
source nodes. These byte positions are used to ensure we extract the
correct DWord from the ultimate source, and formulate a correct
perm_mask from the extracted DWord.

With the typo, we the S0 byte would used the DWord offset for the
corresponding S1 byte. If this offset was not the same as the true DWord
offset for the S0 byte, we would extract and use the wrong byte for S0
in the v_dot.

Fixes https://github.com/llvm/llvm-project/issues/112941
2024-11-06 20:48:20 -08:00
Gang Chen
8c752900dd
[AMDGPU] modify named barrier builtins and intrinsics (#114550)
Use a local pointer type to represent the named barrier in builtin and
intrinsic. This makes the definitions more user friendly
bacause they do not need to worry about the hardware ID assignment. Also
this approach is more like the other popular GPU programming language.
Named barriers should be represented as global variables of addrspace(3)
in LLVM-IR. Compiler assigns the special LDS offsets for those variables
during AMDGPULowerModuleLDS pass. Those addresses are converted to hw
barrier ID during instruction selection. The rest of the
instruction-selection changes are primarily due to the
intrinsic-definition changes.
2024-11-06 10:37:22 -08:00
Stanislav Mekhanoshin
6d7e51de5e
[AMDGPU] Extend type support for update_dpp intrinsic (#114597)
We can split 64-bit DPP as a post-RA pseudo if control values are
supported, but cannot handle other types.
2024-11-05 13:59:14 -08:00
Matt Arsenault
30dd1297fa
AMDGPU: Custom expand flat cmpxchg which may access private (#109410)
64-bit flat cmpxchg instructions do not work correctly for scratch
addresses, and need to be expanded as non-atomic.

Allow custom expansion of cmpxchg in AtomicExpand, as is
already the case for atomicrmw.
2024-11-04 09:29:38 -08:00
Shilei Tian
11df0ce140 [NFC][AMDGPU] Use structured binding to replace explicit use of std::pair 2024-11-02 15:11:55 -04:00
Matt Arsenault
1d0370872f
AMDGPU: Expand flat atomics that may access private memory (#109407)
If the runtime flat address resolves to a scratch address,
64-bit atomics do not work correctly. Insert a runtime address
space check (which is quite likely to be uniform) and select between
the non-atomic and real atomic cases.

Consider noalias.addrspace metadata and avoid this expansion when
possible (we also need to consider it to avoid infinitely expanding
after adding the predication code).
2024-10-31 08:08:48 -07:00
Stanislav Mekhanoshin
7cd29741fa
[AMDGPU] Extend mov_dpp8 intrinsic lowering for generic types (#114296)
The int_amdgcn_mov_dpp8 is overloaded, but we can only select i32.
To allow a corresponding builtin to be overloaded the same way as
int_amdgcn_mov_dpp we need it to be able to split unsupported values.
2024-10-31 01:15:25 -07:00