6628 Commits

Author SHA1 Message Date
Matt Arsenault
54bda79335 AMDGPU: Simplify and improve sincos matching
The first trivial example I tried failed to merge due to the user scan
logic. Remove the complicated scan of users handling with distance
thresholds, with a same block restriction. The actual expansion of
sincos is basically the same size as sin or cos individually. Copy the
technique the generic optimization uses, which is to just use the
input instruction as the insert point or just insert at the start of
the entry block.

https://reviews.llvm.org/D156706
2023-08-02 17:48:35 -04:00
Matt Arsenault
b953155b49 AMDGPU: Fix counting debug instructions in execz skip threshold 2023-08-02 08:09:41 -04:00
Mirko Brkusanin
acdc503d6c [AMDGPU][GlobalISel] Update applyMappingImpl for G_ABS and type v2s16
For G_ABS with type v2s16 and sgpr inputs break down into two s32 G_ABS
instructions.

Patch by: Acim Maravic

Differential Revision: https://reviews.llvm.org/D155867
2023-08-02 12:27:06 +02:00
Mirko Brkusanin
fadf3e7f2b [AMDGPU][GlobalISel] Update legalizer for G_ABS, G_SMIN, G_SMAX, G_UMIN, G_UMAX
There is no need to increase the size of odd sized vectors if they are
going to be scalarized by a different rule.

Patch by: Acim Maravic

Differential Revision: https://reviews.llvm.org/D155865
2023-08-02 12:18:18 +02:00
Jay Foad
c2093b8504 [AMDGPU] Add target features for GDS and GWS
GFX9 subtargets from GFX90A onwards lack GDS but still have GWS.

Differential Revision: https://reviews.llvm.org/D156713
2023-08-02 09:02:07 +01:00
Matt Arsenault
5dfdd3494b AMDGPU: Don't try to fold wavefrontsize intrinsic in libcall simplify
It's not a libcall so doesn't really belong here to begin
with. Relying on checking the target name and explicit features isn't
particularly sound either. The library doesn't use the intrinsic
anymore, so it doesn't matter anyway.
2023-08-01 18:20:50 -04:00
Matt Arsenault
eb00555c16 AMDGPU: Add more tests for sincos recognition
These show both broken cases and cases which are handled too
conservatively.
2023-08-01 18:20:50 -04:00
Matt Arsenault
4d42e8b5d1 Reapply "[CodeGen]Allow targets to use target specific COPY instructions for live range splitting"
This reverts commit a496c8be6e638ae58bb45f13113dbe3a4b7b23fd.

The workaround in c26dfc81e254c78dc23579cf3d1336f77249e1f6 should work
around the underlying problem with SUBREG_TO_REG.
2023-07-31 20:15:45 -04:00
Matt Arsenault
5b5bd81b71 AMDGPU: Move placement of RemoveIncompatibleFunctions
This should be approximately first and run with other module passes.

https://reviews.llvm.org/D155987
2023-07-31 19:22:04 -04:00
Matt Arsenault
db4d6ef9ef AMDGPU: Directly emit fabs intrinsic instead of new libcall 2023-07-31 19:19:56 -04:00
Matt Arsenault
02a0b11331 AMDGPU: Remove weird usage of implicit operand on COPY
For the purpose of the test it works as well to have a use after the
copy itself.
2023-07-31 19:16:11 -04:00
Matt Arsenault
0aa439d502 AMDGPU/GlobalISel: Use SGPR results for G_AMDGPU_WAVE_ADDRESS 2023-07-31 19:16:11 -04:00
Matt Arsenault
8a677a7ff0 AMDGPU: Partially respect nobuiltin in libcall simplifier
There are more contexts where it's not handled correctly but this is
the simplest one.

https://reviews.llvm.org/D156682
2023-07-31 10:56:46 -04:00
Sameer Sahasrabuddhe
d9847cde48 [GlobalISel] convergent intrinsics
Introduced the convergent equivalent of the existing G_INTRINSIC opcodes:

- G_INTRINSIC_CONVERGENT
- G_INTRINSIC_CONVERGENT_W_SIDE_EFFECTS

Out of the targets that currently have some support for GlobalISel, the patch
assumes that the convergent intrinsics only relevant to SPIRV and AMDGPU.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D154766
2023-07-31 12:15:39 +05:30
Jay Foad
e2e3f06813 Revert "[MachineScheduler] Track physical register dependencies per-regunit"
This reverts commit 1a54671d5405a39de362e9692ce963c0638023bc.

It was causing lit test failures in a LLVM_ENABLE_EXPENSIVE_CHECKS
build.
2023-07-29 18:05:25 +01:00
Jay Foad
1a54671d54 [MachineScheduler] Track physical register dependencies per-regunit
Change the scheduler's physical register dependency tracking from
registers-and-their-aliases to regunits. This has a couple of advantages
when subregisters are used:

- The dependency tracking is more accurate and creates fewer useless
  edges in the dependency graph. An AMDGPU example, edited for clarity:

    SU(0): $vgpr1 = V_MOV_B32 $sgpr0
    SU(1): $vgpr1 = V_ADDC_U32 0, $vgpr1
    SU(2): $vgpr0_vgpr1 = FLAT_LOAD_DWORDX2 $vgpr0_vgpr1, 0, 0

  There is a data dependency on $vgpr1 from SU(0) to SU(1) and from
  SU(1) to SU(2). But the old dependency tracking code also added a
  useless edge from SU(0) to SU(2) because it thought that SU(0)'s def
  of $vgpr1 aliased with SU(2)'s use of $vgpr0_vgpr1.

- On targets like AMDGPU that make heavy use of subregisters, each
  register can have a huge number of aliases - it can be quadratic in
  the size of the largest defined register tuple. There is a much lower
  bound on the number of regunits per register, so iterating over
  regunits is faster than iterating over aliases.

The LLVM compile-time tracker shows a tiny overall improvement of 0.03%
on X86. I expect a larger compile-time improvement on targets like
AMDGPU.

Differential Revision: https://reviews.llvm.org/D156552
2023-07-29 15:34:53 +01:00
Jay Foad
5a64c89c8d [MachineScheduler] Test case for physical register dependencies
Differential Revision: https://reviews.llvm.org/D156551
2023-07-29 15:34:53 +01:00
Matt Arsenault
3240ae7034 AMDGPU/GlobalISel: Set dead on scc on manually selected instructions
In SelectionDAG InstrEmitter automatically puts dead flags on unused
physreg defs everywhere. The generated selectors should also set dead
on physreg defs that were not used in the pattern.
2023-07-28 14:14:06 -04:00
Jeffrey Byrnes
391249d1af [AMDGPU] Allow 8,16 bit sources in calculateSrcByte
This is required for many trees produced in practice for i8 CodeGen.

Differential Revision: https://reviews.llvm.org/D155864

Change-Id: Iac01d183d9998b15138bdc7a5051e3bed338e7d9
2023-07-28 09:50:21 -07:00
Matt Arsenault
95e5a461f5 AMDGPU: Always custom lower extract_subvector
The patterns were ripped out in
a4a3ac10cb1a40ccebed4e81cd7e94f1eb71602d so this always needs to be
custom lowered. I absolutely hate how difficult it is to write tests
for these, I have no doubt there are more of these hidden.

Fixes #64142
2023-07-27 08:46:44 -04:00
Vitaly Buka
a496c8be6e Revert "[CodeGen]Allow targets to use target specific COPY instructions for live range splitting"
And dependent commits.

Details in D150388.

This reverts commit 825b7f0ca5f2211ec3c93139f98d1e24048c225c.
This reverts commit 7a98f084c4d121244ef7286bc6503b6a181d446e.
This reverts commit b4a62b1fa546312d882fa12dfdcd015177d66826.
This reverts commit b7836d856206ec39509d42529f958c920368166b.

No conflicts in the code, few tests had conflicts in autogenerated CHECKs:
llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll

Reviewed By: alexfh

Differential Revision: https://reviews.llvm.org/D156381
2023-07-26 22:13:32 -07:00
Pravin Jagtap
1462053608 [AMDGPU] Propagate constants for llvm.amdgcn.wave.reduce.umin/umax
Reviewed By: arsenm, #amdgpu

Differential Revision: https://reviews.llvm.org/D156077
2023-07-26 23:46:01 -04:00
pvanhout
a8aabba587 [AMDGPU] Fix PromoteAlloca Subvector Stores for Single Elements
The previous condition was incorrect in some cases, like storing <2 x i32>
into a double. If IndexVal was >0, we ended up never storing anything.

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D156308
2023-07-26 13:21:21 +02:00
pvanhout
6a767fbc36 [AMDGPU] Precommit tests for D156308
Also includes another testcase that's unrelated, it's just a sanity check.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D156309
2023-07-26 13:21:20 +02:00
Corbin Robeck
7a4968b5a3 [AMDGPU] Add dynamic stack bit info to kernel-resource-usage Rpass output
In code object 5 (https://llvm.org/docs/AMDGPUUsage.html#code-object-v5-metadata) the AMDGPU backend added the .uses_dynamic_stack bit to the kernel meta data to identity kernels which have compile time indeterminable stack usage (indirect function calls and recursion mainly). This patch adds this information to the output of the kernel-resource-usage remarks.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D156040

Author:    Corbin Robeck <corbin.robeck@amd.com>
2023-07-25 12:20:13 -07:00
Kevin P. Neal
76c22b18ea [FPEnv][AMDGPU] Correct strictfp tests.
Correct AMDGPU strictfp tests to follow the rules documented in the LangRef:
https://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics

Mostly these tests just needed the strictfp attribute on function
definitions.  I've also removed the strictfp attribute from uses
of the constrained intrinsics because it comes by default since
D154991, but I only did this in tests I was changing anyway.

I also removed attributes added to declare lines of intrinsics. The
attributes of intrinsics cannot be changed in a test so I eliminated
attempts to do so.

Test changes verified with D146845.
2023-07-25 13:24:46 -04:00
Matt Arsenault
e3fd8f83a8 AMDGPU: Correctly expand f64 sqrt intrinsic
rocm-device-libs and llpc were avoiding using f64 sqrt
intrinsics in favor of their own expansions. Port the
expansion into the backend. Both of these users should be
updated to call the intrinsic instead.

The library and llpc expansions are slightly different.
llpc uses an ldexp to do the scale; the library uses a multiply.

Use ldexp to do the scale instead of the multiply.
I believe v_ldexp_f64 and v_mul_f64 are always the same number of
cycles, but it's cheaper to materialize the 32-bit integer constant
than the 64-bit double constant.

The libraries have another fast version of sqrt which will
be handled separately.

I am tempted to do this in an IR expansion instead. In the IR
we could take advantage of computeKnownFPClass to avoid
the 0-or-inf argument check.
2023-07-25 07:54:11 -04:00
Matt Arsenault
47b3ada432 AMDGPU: Add more sqrt f64 lowering tests
Almost all permutations of the flags are potentially relevant.
2023-07-25 07:54:11 -04:00
pvanhout
3cd4afce5b [AMDGPU] Allow vector access types in PromoteAllocaToVector
Depends on D152706
Solves SWDEV-408279

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D155699
2023-07-25 07:44:48 +02:00
pvanhout
3890a3b113 [AMDGPU] Use SSAUpdater in PromoteAlloca
This allows PromoteAlloca to not be reliant on a second SROA run to remove the alloca completely. It just does the full transformation directly.

Note PromoteAlloca is still reliant on SROA running first to
canonicalize the IR. For instance, PromoteAlloca will no longer handle aggregate types because those should be simplified by SROA before reaching the pass.

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D152706
2023-07-25 07:44:47 +02:00
Matt Arsenault
0d797b71eb RegisterCoaleser: Fix empty subrange verifier error
In this example an implicit def had live-out undef subrange
defs. After coalescing with the def from a previous block, the
undef-defed lanes are no longer live out of the block in the new
interval. An empty subrange was tenatively created for these lanes,
but it must be deleted.
2023-07-24 12:18:34 -04:00
Matt Arsenault
2a53b6c06b RegisterCoalescer: Fix verifier error on redef of subregister for live out implicit_defs
A live out implicit_def wasn't deleted, but the subranges weren't
correctly updated. The main range was correct but the def
corresponding to the initial main range def instruction was missing
from the lanes redefined in another block.

The written lanes are not quite the same as the valid lanes in the
case of an implicit_def.

Fixes verifier error in blender. There is an additional verifier in
some of the testcase variants where an empty subrange remains.
2023-07-24 12:18:34 -04:00
Matt Arsenault
e561e7cb48 AMDGPU: Implement combineRepeatedFPDivisors 2023-07-24 11:19:36 -04:00
Pravin Jagtap
d163b76ce3 [AMDGPU] Fix llvm.amdgcn.wave.reduce.umax/umin MIR tests
Fixes the MIR tests reported in https://lab.llvm.org/buildbot/#/builders/16/builds/51955

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D156125
2023-07-24 10:19:37 -04:00
Pravin Jagtap
c48ed93cf8 [AMDGPU] Add llvm.amdgcn.wave.reduce.umin/umax Intrinsic.
When input to intrinsic is uniform value, reduced value is
same as input whereas if input value is divergent we need
to iterate over all active lanes of WaveFront to perform
the reduction.

The control flow for a `loop` has been set up, which
iterates over `only` active lanes to perform reduction.

Introduced WAVE_REDUCE_UMIN_PSEUDO_U32 and
WAVE_REDUCE_UMAX_PSEUDO_U32 Pseudos which
are lowered Post-ISel (in `EmitInstrWithCustomInserter `).

Reviewed By: arsenm, #amdgpu

Differential Revision: https://reviews.llvm.org/D154858
2023-07-24 00:06:00 -04:00
Matt Arsenault
8406c3568a AMDGPU: Implement new 2ulp fdiv lowering
Extends the new frexp scaled reciprocal to the general case. The
reciprocal case is just the same thing when frexp of 1 is constant
folded. Could probably clean up the code to rely on that constant
folding.

Improves results for the IEEE path for the default OpenCL division. We
used to only emit the fdiv.fast intrinsic with a 2.5 ulp accuracy
threshold with DAZ, which uses explicit range checks. This gives us a
better fast option with the default IEEE behavior.
2023-07-21 18:55:42 -04:00
Matt Arsenault
6699c37028 AMDGPU: Refactor AMDGPUCodeGenPrepare fdiv handling
NFC-ish. Does trigger some reordering of the fdiv scalarization. Also
skips scalarizing in more cases where nothing was going to happen. We
can still scalarize in some no-op edge cases.

https://reviews.llvm.org/D155740
2023-07-21 18:55:42 -04:00
Matt Arsenault
8287f3af9d AMDGPU: Overhaul and improve rcp and rsq f32 formation
The highlight change is a new denormal safe 1ulp lowering which uses
rcp after using frexp to perform input scaling. This saves 2
instructions compared to other implementations which performed an
explicit denormal range change. This improves the OpenCL default, and
requires a flag for HIP. I don't believe there's any flag wired up for
OpenMP to emit the necessary fpmath metadata.

This provides several improvements and changes that were hard to
separate without regressing one case or another. Disturbingly the
OpenCL conformance test seems to have the reciprocal test commented
out. I locally hacked it back in to test this.

Starts introducing f32 rsq intrinsics in AMDGPUCodeGenPrepare. Like
the rcp case, we could do this in codegen if !fpmath were preserved
(although we would lose some computeKnownFPClass tricks). Start
requiring contract flags to form rsq. The rsq fusion actually improves
the result from ~2ulp to ~1ulp. We have some older fusion in codegen
which only keys off unsafe math which should be refined.

Expand rsq patterns by checking for denormal inputs and pre/post
multiplying like the current library code does. We also take advantage
of computeKnownFPClass to avoid the scaling when we can statically
prove the input cannot be a denormal. We could do the same for the rcp
case, but unlike rsq a large input can underflow to denormal. We need
additional upper bound exponent checks on the input in order to do the
same for rcp.

This rsq handling also now starts handling the negated case. We
introduce rsq with an fneg. In the case the fneg doesn't fold into its
user, it's a neutral change but provides improvement if it is foldable
as a source modifier.

Also starts respecting the arcp attribute properly, and more strictly
interprets afn. We were previously interpreting afn as implying you
could do the reciprocal expansion of an fdiv. The codegen handling of
these also needs to be revisited.

This also effectively introduces the optimization
combineRepeatedFPDivisors enables, just done in the IR instead (and
only for f32).

This is almost across the board better. The one minor regression is
for gfx6/buggy frexp case where for multiple reciprocals, we could
previously reuse rematerialized constants per instance (it's neutral
for a single rcp).

The fdiv.fast and sqrt handling need to be revisited next.

https://reviews.llvm.org/D155593
2023-07-21 16:35:53 -04:00
Matt Arsenault
37512d7629 AMDGPU: Add baseline test for fdiv combine 2023-07-21 16:04:12 -04:00
Jay Foad
e45a0c2994 [AMDGPU][RFC] Update isLegalAddressingMode for GFX9 SMEM signed offsets
Differential Revision: https://reviews.llvm.org/D155587
2023-07-21 10:56:43 +01:00
Jay Foad
787bef0bee [AMDGPU] Add tests for SMEM addressing modes in CodeGenPrepare
Differential Revision: https://reviews.llvm.org/D155854
2023-07-21 10:56:43 +01:00
Matt Arsenault
d33ab05467 AMDGPU: Add flag to disable fdiv processing in IR pass
We kind of have to have multiple implementations of fdiv split between
the two selectors with some pre-processing. Add yet another test to
check for consistency of interpretation of flag combinations. We have
quite a bit of test redundancy here already, but there are so many
possible interesting permutations it's unwieldy to cover every detail
in any one of them. We have a number of overlapping fdiv tests but
it's hard to follow everything going on as it is.
2023-07-20 19:51:15 -04:00
Matt Arsenault
b2d58b596c AMDGPU: Expand rsq testing to cover contract flag
The 1.0/sqrt(x) -> rsq(x) fold increases precision and probably needs
a contract flag.
2023-07-20 19:51:15 -04:00
Matt Arsenault
fb54afd1b7 AMDGPU: Fold fsub [+-0] into fneg when folding source modifiers
This isn't always folded to fneg for a freestanding fsub depending on
the denormal mode. When matching source modifiers, we're implicitly
canonicalizing the input so we can fold it here.

Doesn't bother handling the VOP3P case since it's only relevant with
DAZ, which nobody really uses with f16.

For f64, tests show an existing bug where DAGCombiner tries to respect
the denormal mode for fsub -0, x, but not after it's lowered to fadd
-0, (fneg x). Either the fold is wrong or we shouldn't restrict the
fsub case based on the denormal mode.

https://reviews.llvm.org/D155652
2023-07-20 19:29:40 -04:00
Matt Arsenault
881e9f2934 AMDGPU: Regenerate test checks
Mostly a workaround for recent reverts in update_test_checks
2023-07-20 19:26:35 -04:00
Matt Arsenault
ca34f1bdcd AMDGPU: Add baseline test for folding fsub into fneg modifiers 2023-07-20 18:29:35 -04:00
Matt Arsenault
0295513238 AMDGPU: Filter out contract flags when lowering exp
It is unsafe to contract the fsub into the fmul. It also increases
code size by duplicating a constant.
2023-07-20 18:14:24 -04:00
Matt Arsenault
076bc374fc AMDGPU: Add some new baseline tests for exp lowering 2023-07-20 18:14:24 -04:00
Jingu Kang
351b4c17dd Revert "[MachineLICM] Handle Subloops"
This reverts commit 50dd383d08670960540fecb4b48c0f0429fbfba3.
2023-07-20 17:12:25 +01:00
Jingu Kang
50dd383d08 [MachineLICM] Handle Subloops
Following discussion on https://reviews.llvm.org/D154205, make MachineLICM pass
handle subloops with only visiting outmost loop's blocks once.

Differential Revision: https://reviews.llvm.org/D154205
2023-07-20 16:39:13 +01:00