1495 Commits

Author SHA1 Message Date
Jay Foad
6bf4476ffb
[AMDGPU] Fix @llvm.amdgcn.cs.chain with callee not provably uniform (#114200)
The correct behavior is to insert a readfirstlane. This worked except
for an inappropriate assertion in SITargetLowering::LowerCall.
2024-10-30 16:18:29 +00:00
Jay Foad
8ee5e19c87
[AMDGPU] Fix @llvm.amdgcn.cs.chain with SGPR args not provably uniform (#114232)
The correct behaviour is to insert a readfirstlane. SelectionDAG was
already doing this in some cases, but not in the general case for chain
calls. GlobalISel was already doing this for return values but not for
arguments.
2024-10-30 16:12:37 +00:00
Shilei Tian
4cf128512b
[NFC][AMDGPU] Use C++17 structured bindings as much as possible (#113939)
This only changes `llvm/lib/Target/AMDGPU/SIISelLowering.cpp`.
There are five uses of `std::tie` remaining because they can't be
replaced with
C++17 structured bindings.
2024-10-28 13:55:57 -04:00
Shilei Tian
c3fe0e46e2
[NFC][AMDGPU] clang-format llvm/lib/Target/AMDGPU/SIISelLowering.cpp (#112645) 2024-10-21 16:42:25 -07:00
Rahul Joshi
6924fc0326
[LLVM] Add Intrinsic::getDeclarationIfExists (#112428)
Add `Intrinsic::getDeclarationIfExists` to lookup an existing
declaration of an intrinsic in a `Module`.
2024-10-16 07:21:10 -07:00
Fabian Ritter
173c68239d
[AMDGPU] Enable unaligned scratch accesses (#110219)
This allows us to emit wide generic and scratch memory accesses when we
do not have alignment information. In cases where accesses happen to be
properly aligned or where generic accesses do not go to scratch memory,
this improves performance of the generated code by a factor of up to 16x
and reduces code size, especially when lowering memcpy and memmove
intrinsics.

Also: Make the use of the FeatureUnalignedScratchAccess feature more
consistent: FeatureUnalignedScratchAccess and EnableFlatScratch are now
orthogonal, whereas, before, code assumed that the latter implies the
former at some places.

Part of SWDEV-455845.
2024-10-11 08:50:49 +02:00
Jay Foad
62b3a4bc70
[AMDGPU] Improve codegen for s_barrier_init (#111866) 2024-10-10 19:40:02 +01:00
Matt Arsenault
c198f775cd
AMDGPU: Remove flat/global fmin/fmax intrinsics (#105642)
These have been replaced with atomicrmw
2024-10-09 09:27:28 +04:00
Shilei Tian
88a239d292
[AMDGPU] Adopt new lowering sequence for fdiv16 (#109295)
The current lowering of `fdiv16` can generate incorrectly rounded result
in some cases. The new sequence was provided by the HW team, as shown
below written in C++.


```
half fdiv(half a, half b) {
  float a32 = float(a);
  float b32 = float(b);
  float r32 = 1.0f / b32;
  float q32 = a32 * r32;
  float e32 = -b32 * q32 + a32;
  q32 = e32 * r32 + q32;
  e32 = -b32 * q32 + a32;
  float tmp = e32 * r32;
  uin32_t tmp32 = std::bit_cast<uint32_t>(tmp);
  tmp32 = tmp32 & 0xff800000;
  tmp = std::bit_cast<float>(tmp32);
  q32 = tmp + q32;
  half q16 = half(q32);
  q16 = div_fixup_f16(q16);
  return q16;
}
```

Fixes SWDEV-477608.
2024-10-08 09:49:20 -04:00
Austin Kerbow
c4d89203f3
[AMDGPU] Support preloading hidden kernel arguments (#98861)
Adds hidden kernel arguments to the function signature and marks them
inreg if they should be preloaded into user SGPRs. The normal kernarg
preloading logic then takes over with some additional checks for the
correct implicitarg_ptr alignment.

Special care is needed so that metadata for the hidden arguments is not
added twice when generating the code object.
2024-10-06 17:44:33 -07:00
Matt Arsenault
428ae0f12e
AMDGPU: Do not tail call if an inreg argument requires waterfalling (#111002)
If we have a divergent value passed to an outgoing inreg argument,
the call needs to be executed in a waterfall loop and thus cannot
be tail called.

The waterfall handling of arbitrary calls is broken on the selectiondag
path, so some of these cases still hit an error later.

I also noticed the argument evaluation code in isEligibleForTailCallOptimization
is not correctly accounting for implicit argument assignments. It also seems
inreg codegen is generally broken; we are assigning arguments to the reserved
private resource descriptor.
2024-10-04 00:04:02 +04:00
Matt Arsenault
c08d7b3de7
AMDGPU: Fix verifier error on tail call target in vgprs (#110984)
We allow tail calls of known uniform function pointers. This
would produce a verifier error if the uniform value is in VGPRs.
Insert readfirstlanes just in case this occurs, which will fold
out later if it is unnecessary.

GlobalISel should need a similar fix, but it currently does not
attempt tail calls of indirect calls.

Fixes #107447
Fixes subissue of #110930
2024-10-03 21:50:56 +04:00
Jay Foad
8d13e7b8c3
[AMDGPU] Qualify auto. NFC. (#110878)
Generated automatically with:
$ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find
lib/Target/AMDGPU/ -type f)
2024-10-03 13:07:54 +01:00
Fabian Ritter
3ba4092c06
[AMDGPU] Check vector sizes for physical register constraints in inline asm (#109955)
For register constraints that require specific register ranges, the
width of the range should match the type of the associated
parameter/return value. With this PR, we error out when that is not the
case. Previously, these cases would hit assertions or llvm_unreachables.

The handling of register constraints that require only a single register
remains more lenient to allow narrower non-vector types for the
associated IR values. For example, constraining an i16 or i8 value to a
32-bit register is still allowed.

Fixes #101190.

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2024-10-01 10:29:35 +02:00
Matt Arsenault
5883ad34d6
DAG: Handle vector legalization of minimumnum/maximumnum (#109779)
Follow the same patterns as the other min/max variants.
2024-09-30 13:43:35 +04:00
sstipano
eb16acedf5
[AMDGPU] Overload resource descriptor in image intrinsics. (#107255) 2024-09-27 15:33:52 +02:00
gonzalobg
0f521931b8
LLVMContext: add getSyncScopeName() to lookup individual scope name (#109484)
This PR adds a `getSyncScopeString(Id)` API to `LLVMContext` that
returns the `StringRef` for that ID, if any.
2024-09-25 11:13:56 -07:00
Pierre van Houtryve
de70b959b1
[AMDGPU] Fix typo in promoteUniformOpToI32 (#109942) 2024-09-25 12:42:57 +02:00
Jay Foad
d075debc50
[AMDGPU] Fix chain handling when lowering barrier intrinsics (#109799)
Previously we would fail an assertion in RemoveNodeFromCSEMaps after
lowering:
t3: ch = llvm.amdgcn.s.barrier.join t0, TargetConstant:i64<2973>,
Constant:i32<0>
to:
  t6: ch = S_BARRIER_JOIN_IMM TargetConstant:i32<0>
2024-09-24 16:50:46 +01:00
Pierre van Houtryve
758444ca3e
[AMDGPU] Promote uniform ops to I32 in DAGISel (#106383)
Promote uniform binops, selects and setcc between 2 and 16 bits to 32
bits in DAGISel

Solves #64591
2024-09-19 09:00:21 +02:00
Jay Foad
2c85fe9689 [AMDGPU] Remove miscellaneous unused code. NFC. 2024-09-18 16:45:08 +01:00
Piotr Sobczak
adf02ae41f
[AMDGPU] Simplify lowerBUILD_VECTOR (#109094)
Simplify `lowerBUILD_VECTOR` by commoning up the way the vectors
are split.
Also reorder the checks to avoid a long condition inside `if`.
2024-09-18 12:58:16 +02:00
Jay Foad
c657a6f6aa
[AMDGPU] Fix selection of s_load_b96 on GFX11 (#108029)
Fix a bug which resulted in selection of s_load_b96 on GFX11, which only
exists in GFX12.

The root cause was a mismatch between legalization and selection. The
condition used to check that the load was uniform in legalization
(SITargetLowering::LowerLOAD) was "!Op->isDivergent()". The condition
used to detect a non-uniform load during selection
(AMDGPUDAGToDAGISel::isUniformLoad()) was
"N->isDivergent() && !AMDGPUInstrInfo::isUniformMMO(MMO)". This makes a
difference when IR uniformity analysis has more information than SDAG's
built in analysis. In the test case this is because IR UA reports that
everything is uniform if isSingleLaneExecution() returns true, e.g. if
the specified max flat workgroup size is 1, but SDAG does not have this
optimization.

The immediate fix is to use the same condition to detect uniform loads
in legalization and selection. In future SDAG should learn about
isSingleLaneExecution(), and then it could probably stop relying on IR
metadata to detect uniform loads.
2024-09-12 13:41:40 +01:00
Jay Foad
e55d6f5ea2
[AMDGPU] Simplify and improve codegen for llvm.amdgcn.set.inactive (#107889)
Always generate v_cndmask_b32 instead of modifying exec around
v_mov_b32. This is expected to be faster because
modifying exec generally causes pipeline stalls.
2024-09-11 17:16:06 +01:00
Nicolas Miller
ccc52a817f
[AMDGPU] Remove dead code in SIISelLowering (NFC) (#108198)
This return is dead code as the return just above will always be taken.
2024-09-11 15:52:10 +01:00
Jay Foad
7a30b9c0f0
[AMDGPU] Make more use of getWaveMaskRegClass. NFC. (#108186) 2024-09-11 14:55:53 +01:00
Jay Foad
306b08c3a9 [AMDGPU] Remove unused SITargetLowering::isMemOpUniform 2024-09-10 13:11:55 +01:00
Stanislav Mekhanoshin
0745219d4a
[AMDGPU] Add target intrinsic for s_buffer_prefetch_data (#107293) 2024-09-06 11:41:21 -07:00
Stanislav Mekhanoshin
bd840a4004
[AMDGPU] Add target intrinsic for s_prefetch_data (#107133) 2024-09-05 15:14:31 -07:00
Changpeng Fang
26b0bef192
AMDGPU: Use pattern to select instruction for intrinsic llvm.fptrunc.round (#105761)
Use GCNPat instead of Custom Lowering to select instructions for
intrinsic llvm.fptrunc.round. "SupportedRoundMode : TImmLeaf" is used as
a predicate to select only when the rounding mode is supported.
"as_hw_round_mode : SDNodeXForm" is developed to translate the round
modes to the corresponding ones that hardware recognizes.
2024-08-29 11:43:58 -07:00
Austin Kerbow
ceb587a16c
[AMDGPU] Fix crash in allowsMisalignedMemoryAccesses with i1 (#105794) 2024-08-23 11:51:37 -07:00
Jay Foad
b02b5b7b59
[AMDGPU] Simplify use of hasMovrel and hasVGPRIndexMode (#105680)
The generic subtarget has neither of these features. Rather than forcing
HasMovrel on, it is simpler to expand dynamic vector indexing to a
sequence of compare/select instructions.

NFC for real subtargets.
2024-08-23 09:59:19 +01:00
Matt Arsenault
ee08d9cba5
AMDGPU: Remove global/flat atomic fadd intrinics (#97051)
These have been replaced with atomicrmw.
2024-08-22 23:27:33 +04:00
Matt Arsenault
9d364286f3
AMDGPU: Remove flat/global atomic fadd v2bf16 intrinsics (#97050)
These are now fully covered by atomicrmw.
2024-08-21 14:26:42 +04:00
Jay Foad
2258bc429b
[AMDGPU] Simplify, fix and improve known bits for mbcnt (#104768)
Simplify by using KnownBits::add.

Fix GlobalISel path which was ignoring the known bits of src1.

Improve analysis of mbcnt.hi which adds at most 31 even in wave64.
2024-08-19 18:22:06 +01:00
Changpeng Fang
16929219b0
AMDGPU: Add tonearest and towardzero roundings for intrinsic llvm.fptrunc.round (#104486)
This work simplifies and generalizes the instruction definition for
intrinsic llvm.fptrunc.round. We no longer name the instruction with the
rounding mode. Instead, we introduce an immediate operand for the
rounding mode for the pseudo instruction. This immediate will be used to
set up the hardware mode register at the time the real instruction is
generated. We name the pseudo instruction as FPTRUNC_ROUND_F16_F32 (for
f32 -> f16), which is easy to generalize for other types.

"round.towardzero" and "round.tonearest" are added for f32 -> f16
truncating, in addition to the existing "round.upward" and
"round.downward". Other rounding modes are not supported by hardware at
this moment.
2024-08-17 11:22:47 -07:00
Matt Arsenault
ef56061dcf AMDGPU: Rename type helper functions in atomic handling
Requested on #95394
2024-08-16 22:54:55 +04:00
Matt Arsenault
0f08aa43a6
AMDGPU: Avoid manually reconstructing atomicrmw (#103769)
When introducing the address space predicates, move and mutate
the original instruction, and clone for the shared case.
2024-08-14 22:01:15 +04:00
Matt Arsenault
0edd07770f
AMDGPU: Preserve alignment when custom expanding atomicrmw (#103768) 2024-08-14 17:16:59 +04:00
Craig Topper
51bad732dc [SelectionDAG] Replace EVTToAPFloatSemantics with MVT/EVT::getFltSemantics. (#103001) 2024-08-13 11:35:28 -07:00
Matt Arsenault
edded8d7b5
AMDGPU: Stop handling legacy amdgpu-unsafe-fp-atomics attribute (#101699)
This is now autoupgraded to annotate atomicrmw instructions in
old bitcode.
2024-08-13 22:02:25 +04:00
Matt Arsenault
1ae507d109
AMDGPU: Do not create phi user for atomicrmw with no uses (#103061) 2024-08-13 19:24:52 +04:00
Kazu Hirata
f4fb735840
[llvm] Construct SmallVector<SDValue> with ArrayRef (NFC) (#102578) 2024-08-09 09:15:42 -07:00
Matt Arsenault
42b5540211
AMDGPU: Preserve atomicrmw name when specializing address space (#102470) 2024-08-09 00:43:04 +04:00
Matt Arsenault
bb7143f666
AMDGPU: Avoid creating unnecessary block split in atomic expansion (#102440)
This was creating a new block to insert the is.shared check, but we
can just do that in the original block.
2024-08-09 00:39:12 +04:00
Matt Arsenault
88a85942ce
AMDGPU: Directly handle all atomicrmw cases in SIISelLowering (#102439) 2024-08-08 22:45:43 +04:00
Matt Arsenault
c66777ee1b AMDGPU: Generalize atomicrmw handling in custom expansion
Use the utility function instead of assuming fadd. No change
as-is, but will soon be used for other expansions.
2024-08-08 12:30:24 +04:00
Matt Arsenault
dfda9c5b9e
AMDGPU: Handle new atomicrmw metadata for fadd case (#96760)
This is the most complex atomicrmw support case. Note we don't have
accurate remarks for all of the cases, which I'm planning on fixing
in a later change with more precise wording.

Continue respecting amdgpu-unsafe-fp-atomics until it's eventual removal.
Also seems to fix a few cases not interpreting amdgpu-unsafe-fp-atomics
appropriately aaggressively.
2024-08-02 19:41:33 +04:00
Matt Arsenault
41439d5bb7
AMDGPU: Handle remote/fine-grained memory in atomicrmw fmin/fmax lowering (#96759)
Consider the new atomic metadata when choosing to expand as cmpxchg
instead.
2024-08-01 22:08:01 +04:00
Matt Arsenault
1d2b2d29d7
AMDGPU: Cleanup extract_subvector actions (NFC) (#101454)
The base AMDGPUISelLowering was setting custom action on 16-bit
vector types, but also set in SIISelLowering.
2024-08-01 10:55:28 +04:00