7296 Commits

Author SHA1 Message Date
Jun Wang
86842e1f72
[AMDGPU] New clang option for emitting a waitcnt instruction after each memory instruction (#79236)
This patch introduces a new command-line option for clang, namely,
amdgpu-precise-mem-op (or precise-memory in the backend). When this option is specified, a waitcnt
instruction is generated after each memory load/store instruction. The
counter values are always 0, but which counters are involved depends on
the memory instruction.

---------

Co-authored-by: Jun Wang <jun.wang7@amd.com>
2024-04-10 10:47:04 -07:00
Jay Foad
9c58f3a234
[AMDGPU] Fix implicit $vcc operands after parsing MIR (#87781)
MIParser checks that implicit operands match the instruction definition,
so they have to be $vcc even in wave32 mode. Use the mirFileLoaded hook
to fix them after MIParser's checks, converting them to $vcc_lo which is
what that rest of CodeGen expects.

This is all just extending the fixImplicitOperands hack which was
introduced with GFX10, but at least it makes it possible to write a MIR
test which creates the same instructions that normal CodeGen would
generate.
2024-04-09 09:10:45 +01:00
Matt Arsenault
acb2a47576 AMDGPU: Regenerate test checks 2024-04-08 08:17:09 -04:00
David Green
ac321cbb03
[AArch64][GlobalISel] Legalize Insert vector element (#81453)
This attempts to standardize and extend some of the insert vector
element lowering. Most notably:
- More types are handled by splitting illegal vectors.
- The index type for G_INSERT_VECTOR_ELT is canonicalized to
  TLI.getVectorIdxTy(), similar to extact_vector_element.
- Some of the existing patterns now have the index type specified to
  make sure they can apply to GISel too.
- The C++ selection code has been removed, relying on tablegen patterns.
- G_INSERT_VECTOR_ELT with small GPR input elements are pre-selected to
  use a i32 type, allowing the existing patterns to apply.
- Variable index inserts are lowered in post-legalizer lowering,
  expanding into a stack store and reload.
2024-04-08 08:44:13 +01:00
Bevin Hansson
110c22fe12
[ExpandLargeFpConvert] Support bfloat. (#87619)
The conversion expansions did not properly handle bfloat types.

I'm not certain that these expansions are completely correct;
I don't have any experience with AMDGPU or the ability to run
anything to test it.

Note that it doesn't seem like AMDGPU with GlobalISel can
handle fptrunc of float to bfloat, which is needed for itofp.
I've omitted the GISEL run for the bfloat case.

This fixes #85379.
2024-04-08 09:07:55 +02:00
Piotr Sobczak
5b59ae423a
[DAG] Preserve NUW when reassociating (#87621)
Similarly to the generic case below, preserve the NUW flag when
reassociating adds with constants.
2024-04-04 16:47:25 +02:00
Jay Foad
3cf539fb04
[AMDGPU] Combine or remove redundant waitcnts at the end of each MBB (#87539)
Call generateWaitcnt unconditionally at the end of
SIInsertWaitcnts::insertWaitcntInBlock. Even if we don't need to
generate a new waitcnt instruction it has the effect of combining or
removing redundant waitcnts that were already present. Tests show
various small improvements in waitcnt placement.
2024-04-04 10:14:16 +01:00
Bevin Hansson
cd6434f9ec
[ExpandLargeDivRem] Scalarize vector types. (#86959)
expand-large-divrem cannot handle vector types.
If overly large vector element types survive into
isel, they will likely be scalarized there, but since
isel cannot handle scalar integer types of that size,
it will assert.

Handle vector types in expand-large-divrem by
scalarizing them and then expanding the scalar type
operation. For large vectors, this results in a
*massive* code expansion, but it's better than
asserting.
2024-04-02 16:37:36 +02:00
Sameer Sahasrabuddhe
421557974a
[AMDGPU] Use glue for convergence tokens at call-like operations (#86766)
The earlier implementation on AMDGPU used explicit token operands at
SI_CALL and SI_CALL_ISEL. This is now replaced with CONVERGENCECTRL_GLUE
operands, with the following effects:

- The treatment of tokens at call-like operations is now consistent with
the treatment at intrinsics.
- Support for tail calls using implicit tokens at SI_TCRETURN "just
works".
- The extra parameter at call-like instructions is eliminated, thus
restoring those instructions and their handling to the original state.

The new glue node is placed after the existing glue node for the
outgoing call parameters, which seems to not interfere with selection of
the call-like nodes.
2024-04-01 10:51:13 +05:30
Vitaly Buka
20f56e1f8e
[CodeGen] Add default lowering for llvm.allow.{runtime,ubsan}.check() (#86049)
RFC:
https://discourse.llvm.org/t/rfc-add-llvm-experimental-hot-intrinsic-or-llvm-hot/77641
2024-03-31 22:19:33 -07:00
Ruiling, Song
216b5e9666
[AMDGPU] Expose RTZ version of f16 interpolation for gfx11+ (#86614) 2024-04-01 09:48:37 +08:00
Austin Kerbow
b5b34dbb27
[AMDGPU] Use directive for kernarg preload header padding (#86004) 2024-03-31 11:03:03 -07:00
Austin Kerbow
0234d90d81
[AMDGPU] Extend MFMA padding option to gfx90a+ (#86768)
It was shown experimentally that this may have some benefit on newer HW.
2024-03-31 10:46:05 -07:00
Jay Foad
95258419f6
[AMDGPU] Use AMDGPU::isIntrinsicAlwaysUniform in isSDNodeAlwaysUniform (#87085)
This is mostly just a simplification, but tests show a slight codegen
improvement in code using the deprecated amdgcn.icmp/fcmp intrinsics.
2024-03-30 08:01:18 +00:00
Shilei Tian
3a106e5b2c
[GlobalISel] Fold G_ICMP if possible (#86357)
This patch tries to fold `G_ICMP` if possible.
2024-03-29 15:59:50 -04:00
Shilei Tian
661bb9daae
[GlobalISel] Handle div-by-pow2 (#83155)
This patch adds similar handling of div-by-pow2 as in `SelectionDAG`.
2024-03-29 12:41:47 -04:00
Craig Topper
23d45e55ed
[MCP] Remove dead copies from basic blocks with successors. (#86973)
Previously we wouldn't remove dead copies from basic blocks with
successors. The comment said we didn't want to trust the live-in lists.
The comment is very old so I'm not sure if that's still a concern today.

This patch checks the live-in lists and removes copies from
MaybeDeadCopies if they are referenced by any live-ins in any
successors. We only do this if the tracksLiveness property is set. If
that property is not set, we retain the old behavior.
2024-03-28 14:43:49 -07:00
Shilei Tian
0a43ca731b
[AMDGPU] Fix missing IsExact flag when expanding vector binary operator (#86712) 2024-03-27 17:40:58 -04:00
Kevin P. Neal
f5296df97c
[FPEnv][AMDGPU] Correct AMDGPUSimplifyLibCalls handling of strictfp attribute. (#86705)
The AMDGPUSimplifyLibCalls pass was lowering function calls with the
strictfp attribute to sequences that included function calls incorrectly
lacking the attribute. This patch corrects that.

The pass now also emits the correct constrained fp call instead of
normal FP instructions when in a function with the strictfp attribute.
Replacing non-constrained calls with constrained calls when required
is still on the IRBuilder's TODO list.
2024-03-27 10:20:00 -04:00
Matt Arsenault
ef316da4a2 AMDGPU: Fix dead check prefixes in test 2024-03-27 14:42:47 +03:00
Thomas Symalla
256343a0e9
Revert "Update amdgpu_gfx functions to use s0-s3 for inreg SGPR arguments on targets using scratch instructions for stack #78226" (#86273)
Reverts llvm/llvm-project#81394

This reverts commit 3ac243bc0d7922d083af2cf025247b5698556062.
It is not handling RSrc registers s0-s3 correctly. This leads to a
broken test, where it expects s0-s3 as function argument and uses it as
RSrc register as well.
We need to re-visit the patch, but apparently we only want to have s0-s3
as
argument registers if we don't need them as RSrc registers.
2024-03-26 11:01:08 +01:00
David Green
4d315ff382
[GlobalISel] Add CTLZ known bits. (#86436)
Replicated from SDAG.
2024-03-26 09:11:35 +00:00
Bevin Hansson
14c30189fb
[ExpandLargeFpConvert] Fix incorrect values in fp-to-int conversion. (#86514)
The IR for a double-to-i129 conversion looks like this in one of the
blocks in compiler-rt:

  %cmp5.i = icmp ult i16 %3, -129, !dbg !24

But in ExpandLargeFpConvert, it looks like:

  %13 = icmp ult i129 %12, 4294967167, !dbg !19

ExpandLargeFpConvert is wrong; the value should have been
signed before negating, but instead we get a very large
unsigned value. Another value in the same pass also has this
issue.
2024-03-26 10:08:22 +01:00
Changpeng Fang
350bda4419
AMDGPU: Rename intrinsics and remove f16/bf16 versions for load transpose (#86313)
Rename the intrinsics to close to the instruction mnemonic names:
Use global_load_tr_b64 and global_load_tr_b128 instead of
global_load_tr.

This patch also removes f16/bf16 versions of builtins/intrinsics. To
simplify the design, we should avoid enumerating all possible types in
implementing builtins. We can always use bitcast.
2024-03-25 16:55:22 -07:00
Jeffrey Byrnes
b761137049
[AMDGPU] Use correct VGPR threshold for flagging ExcessRP regions in unified register file case (#85860)
`ST.getMaxNumVGPRs(MF)` lowers to `AMDGPUBaseInfo.cpp:getTotalNumVGPRs`
which returns 512 for gfx90a. This is subsequently limited by
`AMDGPUBaseInfo:getAddressableNumVGPRs()`, which also returns 512 for
gfx90a. The ISA states we can have a total of 512 registers, but a
maximum of only 256 of each of AGPR and VGPR (gfx90a 3.6.4).

Therefore, in unified register file case, `ST.getMaxNumVGPRs(MF)`
calculates the maximum number of combined VGPR + AGPR. But, it is
currently used as the limit for accvgpr and as the limit for archvgpr.

This patch uses it as the combined limit, and accounts for the maximum addressable arch/acc VGPRs when calculating the per RegClass limits.

It is not unreasonable to think other clients of getTotalNumVGPRs are
using it in the wrong way.
2024-03-25 13:11:58 -07:00
David Stuttard
06cfbe3cfd
[AMDPU] Add support for idxen and bothen buffer load/store merging in SILoadStoreOptimizer (#86285)
Added more buffer instruction merging support
2024-03-25 14:44:22 +00:00
David Stuttard
75e528fdd9
[AMDGPU] Extend zero initialization of return values for TFE (#85759)
buffer_load instructions that use TFE also need to zero initialize
return values similar to how the image instructions currently work. Add
support for this with standard zero init of all results + zero init of
just TFE flag when enable-prt-strict-null subtarget feature is disabled.
2024-03-25 09:01:46 +00:00
Pierre van Houtryve
babbdad15b
[AMDGPU] Handle non-register operands for S_SUB/ADD_U64_PSEUDO (#86104)
This pseudo uses SSrc_b64 so it allows both an immediate or a register,
but the lowering crashed on immediate operands.
2024-03-25 09:23:40 +01:00
Evgenii Kudriashov
d365a45cb3
[GlobalISel] Introduce G_TRAP, G_DEBUGTRAP, G_UBSANTRAP (#84941)
Here we introduce three new GMIR instructions to cover a set of trap
intrinsics. The idea behind it is that generic intrinsics shouldn't be
used with G_INTRINSIC opcode.

These new instructions can match perfectly with existing trap ISD nodes.
It allows X86, AArch64, RISCV and Mips to reuse SelectionDAG patterns for
selection and avoid manual selection. However AMDGPU is an exception. It
selects traps during legalization regardless SelectionDAG or GlobalISel.

Since there are not many places where traps are used, this change
attempts to clean up all the usages of G_INTRINSIC with trap intrinsics. So,
there is no stage when both G_TRAP and
G_INTRINSIC_W_SIDE_EFFECTS(@llvm.trap) are allowed.
2024-03-23 13:12:44 +01:00
Pravin Jagtap
e1a8120a63
[AMDGPU] Support double type in atomic optimizer. (#84307)
Presently the atomic optimizer supports only 32-bit operations. Plan is
to extend the atomic optimizer for 64-bit operations for compute and
graphics. This patch extends support for double type for `uniform
values` only. Going forward, will extend the support for divergent
values. Adding support for divergent values requires
extending/legalizing readfirstlane, readlane, writelane, etc ops for
64-bit operations to avoid `bitcast` noise that we have currently.

---------

Authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-03-22 09:25:06 +05:30
paperchalice
a2dfc9ac7d
[NewPM][AMDGPU] Add AMDGPUPassRegistry.def (#86095)
Move the pass registry to a separate file, prepare for porting dag-isel.
2024-03-22 08:49:29 +08:00
Jonas Paulsson
7564566779 Reapply "Move assertion for AdjustsStack from PEI to MachineVerifier (#85698)"
- The check is now actually done in both PEI and the MachineVerifier.
- More .mir tests trivially updated with "adjustsStack: true" as needed.
2024-03-21 20:24:57 -04:00
SahilPatidar
3ac243bc0d
Update amdgpu_gfx functions to use s0-s3 for inreg SGPR arguments on targets using scratch instructions for stack #78226 (#81394)
Resolve #78226
2024-03-21 16:52:08 +05:30
Pierre van Houtryve
95a834a16c
(Reland) [AMDGPU] Run LowerLDS at the end of the fullLTO pipeline (#85626)
Reland of #75333
2024-03-21 11:44:47 +01:00
Pierre van Houtryve
ccb3a8feaa
[AMDGPU][LowerModuleLDS] Refactor partially lowered module detection (#85793)
Refactor the logic that checks if a module contains mixed
absolute/non-lowered LDS GVs.

The check now happens latter when the "worklists" are formed. This is
because in some cases (OpenMP) we can have non-lowered GVs in a lowered
module, and this is normal because those GVs are just unused and removed
from the list at some point before the end of `getUsesOfLDSByFunction`.

Doing the check later ensures that if a mixed module is spotted, then
it's a _real_ mixed module that needs rejection, not a module containing
an intentionally ignored GV.
2024-03-21 11:28:35 +01:00
Matt Arsenault
b6b703b2df
AMDGPU: Infer no-agpr usage in AMDGPUAttributor (#85948)
SIMachineFunctionInfo has a scan  of the function body for inline asm
which may use AGPRs, or callees in SIMachineFunctionInfo. Move this
into the attributor, so it actually works interprocedurally.
    
Could probably avoid most of the test churn if this bothered to avoid
adding this on subtargets without AGPRs. We should also probably
try to delete the MIR scan in usesAGPRs but it seems to be trickier
to eliminate.
2024-03-21 14:24:06 +05:30
Thorsten Schütt
deefe3fbc9
[GlobalIsel] Post-review combine ADDO (#85961)
https://github.com/llvm/llvm-project/pull/82927
2024-03-21 03:56:40 +01:00
Jonas Paulsson
9ebd329ad8 Revert "Move assertion for AdjustsStack from PEI to MachineVerifier. (#85698)"
This reverts commit 05bde30585710a51592eee0a6cf6df8184d09c92.

Reverting due to verifier complaints with expensive checks on build-bot.
2024-03-20 11:48:30 -04:00
Jonas Paulsson
05bde30585
Move assertion for AdjustsStack from PEI to MachineVerifier. (#85698)
Have the verifier report a missing AdjustsStack flag rather than waiting until
PEI asserts.
2024-03-20 10:29:12 -04:00
Pravin Jagtap
e52a687871
[AMDGPU][NFC] Test clean up (#85922)
Authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-03-20 17:29:42 +05:30
Pravin Jagtap
070d1e8321
[AMDGPU] Add test for fpext & fptrunc with bf16. (#85909)
Authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-03-20 14:45:38 +05:30
Peter Rong
4a026b5092
[AMDGCN] Use ZExt when handling indices in insertment element (#85718)
When i1 true is used as an index, SExt extends it to i32 -1. This would
cause BitVector to overflow.
The language manual have specified that the index shall be treated as an
unsigned number, this patch fixes that.
(https://llvm.org/docs/LangRef.html#insertelement-instruction)

This patch fixes #85717

---------

Signed-off-by: Peter Rong <PeterRong96@gmail.com>
2024-03-19 21:44:08 -07:00
Changpeng Fang
ab76052fa9
AMDGPU: Treat SWMMAC the same as MFMA and other WMMA for sched_barrier (#85721) 2024-03-19 09:58:09 -07:00
Pravin Jagtap
08701e35ed
[AMDGPU][NFC] Test clean up. (#85775)
Added common check for DPP and Iterative strategies for uniform value
case since optimization applied is same.

Authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-03-19 18:00:34 +05:30
Pierre van Houtryve
953c13b5c9
[AMDGPU][PromoteAlloca] Whole-function alloca promotion to vector (#84735)
Update PromoteAllocaToVector so it considers the whole function before promoting allocas.
Allocas are scored & sorted so the highest value ones are seen first. The budget is now per function instead of per alloca.

Passed internal performance testing.
2024-03-19 11:49:22 +01:00
Jonas Paulsson
09bc6abba6
[MachineFrameInfo] Refactoring around computeMaxcallFrameSize() (NFC) (#78001)
- Use computeMaxCallFrameSize() in PEI::calculateCallFrameInfo() instead of duplicating the code.

- Set AdjustsStack in FinalizeISel instead of in computeMaxCallFrameSize().
2024-03-18 10:37:59 -04:00
Yingwei Zheng
38a44bdc93
[CodeGenPrepare] Reverse the canonicalization of isInf/isNanOrInf (#81572)
In commit
2b582440c1,
we canonicalize the isInf/isNanOrInf idiom into fabs+fcmp for better
analysis/codegen (See also the discussion in
https://github.com/llvm/llvm-project/pull/76338).

This patch reverses the fabs+fcmp to `is.fpclass`. If the `is.fpclass`
is not supported by the target, it will be expanded by TLI.

Fixes the regression introduced by
2b582440c1
and
https://github.com/llvm/llvm-project/pull/80414#issuecomment-1936374206.
2024-03-18 18:27:45 +08:00
pvanhout
3493438605 Revert "[AMDGPU] Run LowerLDS at the end of the fullLTO pipeline (#75333)"
This reverts commit 9b98692eedb78aa106539c36ba02944f32cae1ff.
2024-03-18 11:18:57 +01:00
Pierre van Houtryve
9b98692eed
[AMDGPU] Run LowerLDS at the end of the fullLTO pipeline (#75333)
This change allows us to use `--lto-partitions` in some cases (not at
all guaranteed it works perfectly), as LDS is lowered before the module
is split for parallel codegen.

We must run LowerLDS before splitting modules as it needs to see all
callers of functions with LDS to properly lower them.
2024-03-18 09:09:43 +01:00
Sameer Sahasrabuddhe
ec34699f75
[GlobalISel] convergence control tokens and intrinsics (#67006)
[GlobalISel] Implement convergence control tokens and intrinsics in GMIR

In the IR translator, convert the LLVM token type to LLT::token(), which is an
alias for the s0 type. These show up as implicit uses on convergent operations.

Differential Revision: https://reviews.llvm.org/D158147
2024-03-18 10:34:11 +05:30