6615 Commits

Author SHA1 Message Date
Sameer Sahasrabuddhe
d9847cde48 [GlobalISel] convergent intrinsics
Introduced the convergent equivalent of the existing G_INTRINSIC opcodes:

- G_INTRINSIC_CONVERGENT
- G_INTRINSIC_CONVERGENT_W_SIDE_EFFECTS

Out of the targets that currently have some support for GlobalISel, the patch
assumes that the convergent intrinsics only relevant to SPIRV and AMDGPU.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D154766
2023-07-31 12:15:39 +05:30
Jay Foad
e2e3f06813 Revert "[MachineScheduler] Track physical register dependencies per-regunit"
This reverts commit 1a54671d5405a39de362e9692ce963c0638023bc.

It was causing lit test failures in a LLVM_ENABLE_EXPENSIVE_CHECKS
build.
2023-07-29 18:05:25 +01:00
Jay Foad
1a54671d54 [MachineScheduler] Track physical register dependencies per-regunit
Change the scheduler's physical register dependency tracking from
registers-and-their-aliases to regunits. This has a couple of advantages
when subregisters are used:

- The dependency tracking is more accurate and creates fewer useless
  edges in the dependency graph. An AMDGPU example, edited for clarity:

    SU(0): $vgpr1 = V_MOV_B32 $sgpr0
    SU(1): $vgpr1 = V_ADDC_U32 0, $vgpr1
    SU(2): $vgpr0_vgpr1 = FLAT_LOAD_DWORDX2 $vgpr0_vgpr1, 0, 0

  There is a data dependency on $vgpr1 from SU(0) to SU(1) and from
  SU(1) to SU(2). But the old dependency tracking code also added a
  useless edge from SU(0) to SU(2) because it thought that SU(0)'s def
  of $vgpr1 aliased with SU(2)'s use of $vgpr0_vgpr1.

- On targets like AMDGPU that make heavy use of subregisters, each
  register can have a huge number of aliases - it can be quadratic in
  the size of the largest defined register tuple. There is a much lower
  bound on the number of regunits per register, so iterating over
  regunits is faster than iterating over aliases.

The LLVM compile-time tracker shows a tiny overall improvement of 0.03%
on X86. I expect a larger compile-time improvement on targets like
AMDGPU.

Differential Revision: https://reviews.llvm.org/D156552
2023-07-29 15:34:53 +01:00
Jay Foad
5a64c89c8d [MachineScheduler] Test case for physical register dependencies
Differential Revision: https://reviews.llvm.org/D156551
2023-07-29 15:34:53 +01:00
Matt Arsenault
3240ae7034 AMDGPU/GlobalISel: Set dead on scc on manually selected instructions
In SelectionDAG InstrEmitter automatically puts dead flags on unused
physreg defs everywhere. The generated selectors should also set dead
on physreg defs that were not used in the pattern.
2023-07-28 14:14:06 -04:00
Jeffrey Byrnes
391249d1af [AMDGPU] Allow 8,16 bit sources in calculateSrcByte
This is required for many trees produced in practice for i8 CodeGen.

Differential Revision: https://reviews.llvm.org/D155864

Change-Id: Iac01d183d9998b15138bdc7a5051e3bed338e7d9
2023-07-28 09:50:21 -07:00
Matt Arsenault
95e5a461f5 AMDGPU: Always custom lower extract_subvector
The patterns were ripped out in
a4a3ac10cb1a40ccebed4e81cd7e94f1eb71602d so this always needs to be
custom lowered. I absolutely hate how difficult it is to write tests
for these, I have no doubt there are more of these hidden.

Fixes #64142
2023-07-27 08:46:44 -04:00
Vitaly Buka
a496c8be6e Revert "[CodeGen]Allow targets to use target specific COPY instructions for live range splitting"
And dependent commits.

Details in D150388.

This reverts commit 825b7f0ca5f2211ec3c93139f98d1e24048c225c.
This reverts commit 7a98f084c4d121244ef7286bc6503b6a181d446e.
This reverts commit b4a62b1fa546312d882fa12dfdcd015177d66826.
This reverts commit b7836d856206ec39509d42529f958c920368166b.

No conflicts in the code, few tests had conflicts in autogenerated CHECKs:
llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll

Reviewed By: alexfh

Differential Revision: https://reviews.llvm.org/D156381
2023-07-26 22:13:32 -07:00
Pravin Jagtap
1462053608 [AMDGPU] Propagate constants for llvm.amdgcn.wave.reduce.umin/umax
Reviewed By: arsenm, #amdgpu

Differential Revision: https://reviews.llvm.org/D156077
2023-07-26 23:46:01 -04:00
pvanhout
a8aabba587 [AMDGPU] Fix PromoteAlloca Subvector Stores for Single Elements
The previous condition was incorrect in some cases, like storing <2 x i32>
into a double. If IndexVal was >0, we ended up never storing anything.

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D156308
2023-07-26 13:21:21 +02:00
pvanhout
6a767fbc36 [AMDGPU] Precommit tests for D156308
Also includes another testcase that's unrelated, it's just a sanity check.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D156309
2023-07-26 13:21:20 +02:00
Corbin Robeck
7a4968b5a3 [AMDGPU] Add dynamic stack bit info to kernel-resource-usage Rpass output
In code object 5 (https://llvm.org/docs/AMDGPUUsage.html#code-object-v5-metadata) the AMDGPU backend added the .uses_dynamic_stack bit to the kernel meta data to identity kernels which have compile time indeterminable stack usage (indirect function calls and recursion mainly). This patch adds this information to the output of the kernel-resource-usage remarks.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D156040

Author:    Corbin Robeck <corbin.robeck@amd.com>
2023-07-25 12:20:13 -07:00
Kevin P. Neal
76c22b18ea [FPEnv][AMDGPU] Correct strictfp tests.
Correct AMDGPU strictfp tests to follow the rules documented in the LangRef:
https://llvm.org/docs/LangRef.html#constrained-floating-point-intrinsics

Mostly these tests just needed the strictfp attribute on function
definitions.  I've also removed the strictfp attribute from uses
of the constrained intrinsics because it comes by default since
D154991, but I only did this in tests I was changing anyway.

I also removed attributes added to declare lines of intrinsics. The
attributes of intrinsics cannot be changed in a test so I eliminated
attempts to do so.

Test changes verified with D146845.
2023-07-25 13:24:46 -04:00
Matt Arsenault
e3fd8f83a8 AMDGPU: Correctly expand f64 sqrt intrinsic
rocm-device-libs and llpc were avoiding using f64 sqrt
intrinsics in favor of their own expansions. Port the
expansion into the backend. Both of these users should be
updated to call the intrinsic instead.

The library and llpc expansions are slightly different.
llpc uses an ldexp to do the scale; the library uses a multiply.

Use ldexp to do the scale instead of the multiply.
I believe v_ldexp_f64 and v_mul_f64 are always the same number of
cycles, but it's cheaper to materialize the 32-bit integer constant
than the 64-bit double constant.

The libraries have another fast version of sqrt which will
be handled separately.

I am tempted to do this in an IR expansion instead. In the IR
we could take advantage of computeKnownFPClass to avoid
the 0-or-inf argument check.
2023-07-25 07:54:11 -04:00
Matt Arsenault
47b3ada432 AMDGPU: Add more sqrt f64 lowering tests
Almost all permutations of the flags are potentially relevant.
2023-07-25 07:54:11 -04:00
pvanhout
3cd4afce5b [AMDGPU] Allow vector access types in PromoteAllocaToVector
Depends on D152706
Solves SWDEV-408279

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D155699
2023-07-25 07:44:48 +02:00
pvanhout
3890a3b113 [AMDGPU] Use SSAUpdater in PromoteAlloca
This allows PromoteAlloca to not be reliant on a second SROA run to remove the alloca completely. It just does the full transformation directly.

Note PromoteAlloca is still reliant on SROA running first to
canonicalize the IR. For instance, PromoteAlloca will no longer handle aggregate types because those should be simplified by SROA before reaching the pass.

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D152706
2023-07-25 07:44:47 +02:00
Matt Arsenault
0d797b71eb RegisterCoaleser: Fix empty subrange verifier error
In this example an implicit def had live-out undef subrange
defs. After coalescing with the def from a previous block, the
undef-defed lanes are no longer live out of the block in the new
interval. An empty subrange was tenatively created for these lanes,
but it must be deleted.
2023-07-24 12:18:34 -04:00
Matt Arsenault
2a53b6c06b RegisterCoalescer: Fix verifier error on redef of subregister for live out implicit_defs
A live out implicit_def wasn't deleted, but the subranges weren't
correctly updated. The main range was correct but the def
corresponding to the initial main range def instruction was missing
from the lanes redefined in another block.

The written lanes are not quite the same as the valid lanes in the
case of an implicit_def.

Fixes verifier error in blender. There is an additional verifier in
some of the testcase variants where an empty subrange remains.
2023-07-24 12:18:34 -04:00
Matt Arsenault
e561e7cb48 AMDGPU: Implement combineRepeatedFPDivisors 2023-07-24 11:19:36 -04:00
Pravin Jagtap
d163b76ce3 [AMDGPU] Fix llvm.amdgcn.wave.reduce.umax/umin MIR tests
Fixes the MIR tests reported in https://lab.llvm.org/buildbot/#/builders/16/builds/51955

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D156125
2023-07-24 10:19:37 -04:00
Pravin Jagtap
c48ed93cf8 [AMDGPU] Add llvm.amdgcn.wave.reduce.umin/umax Intrinsic.
When input to intrinsic is uniform value, reduced value is
same as input whereas if input value is divergent we need
to iterate over all active lanes of WaveFront to perform
the reduction.

The control flow for a `loop` has been set up, which
iterates over `only` active lanes to perform reduction.

Introduced WAVE_REDUCE_UMIN_PSEUDO_U32 and
WAVE_REDUCE_UMAX_PSEUDO_U32 Pseudos which
are lowered Post-ISel (in `EmitInstrWithCustomInserter `).

Reviewed By: arsenm, #amdgpu

Differential Revision: https://reviews.llvm.org/D154858
2023-07-24 00:06:00 -04:00
Matt Arsenault
8406c3568a AMDGPU: Implement new 2ulp fdiv lowering
Extends the new frexp scaled reciprocal to the general case. The
reciprocal case is just the same thing when frexp of 1 is constant
folded. Could probably clean up the code to rely on that constant
folding.

Improves results for the IEEE path for the default OpenCL division. We
used to only emit the fdiv.fast intrinsic with a 2.5 ulp accuracy
threshold with DAZ, which uses explicit range checks. This gives us a
better fast option with the default IEEE behavior.
2023-07-21 18:55:42 -04:00
Matt Arsenault
6699c37028 AMDGPU: Refactor AMDGPUCodeGenPrepare fdiv handling
NFC-ish. Does trigger some reordering of the fdiv scalarization. Also
skips scalarizing in more cases where nothing was going to happen. We
can still scalarize in some no-op edge cases.

https://reviews.llvm.org/D155740
2023-07-21 18:55:42 -04:00
Matt Arsenault
8287f3af9d AMDGPU: Overhaul and improve rcp and rsq f32 formation
The highlight change is a new denormal safe 1ulp lowering which uses
rcp after using frexp to perform input scaling. This saves 2
instructions compared to other implementations which performed an
explicit denormal range change. This improves the OpenCL default, and
requires a flag for HIP. I don't believe there's any flag wired up for
OpenMP to emit the necessary fpmath metadata.

This provides several improvements and changes that were hard to
separate without regressing one case or another. Disturbingly the
OpenCL conformance test seems to have the reciprocal test commented
out. I locally hacked it back in to test this.

Starts introducing f32 rsq intrinsics in AMDGPUCodeGenPrepare. Like
the rcp case, we could do this in codegen if !fpmath were preserved
(although we would lose some computeKnownFPClass tricks). Start
requiring contract flags to form rsq. The rsq fusion actually improves
the result from ~2ulp to ~1ulp. We have some older fusion in codegen
which only keys off unsafe math which should be refined.

Expand rsq patterns by checking for denormal inputs and pre/post
multiplying like the current library code does. We also take advantage
of computeKnownFPClass to avoid the scaling when we can statically
prove the input cannot be a denormal. We could do the same for the rcp
case, but unlike rsq a large input can underflow to denormal. We need
additional upper bound exponent checks on the input in order to do the
same for rcp.

This rsq handling also now starts handling the negated case. We
introduce rsq with an fneg. In the case the fneg doesn't fold into its
user, it's a neutral change but provides improvement if it is foldable
as a source modifier.

Also starts respecting the arcp attribute properly, and more strictly
interprets afn. We were previously interpreting afn as implying you
could do the reciprocal expansion of an fdiv. The codegen handling of
these also needs to be revisited.

This also effectively introduces the optimization
combineRepeatedFPDivisors enables, just done in the IR instead (and
only for f32).

This is almost across the board better. The one minor regression is
for gfx6/buggy frexp case where for multiple reciprocals, we could
previously reuse rematerialized constants per instance (it's neutral
for a single rcp).

The fdiv.fast and sqrt handling need to be revisited next.

https://reviews.llvm.org/D155593
2023-07-21 16:35:53 -04:00
Matt Arsenault
37512d7629 AMDGPU: Add baseline test for fdiv combine 2023-07-21 16:04:12 -04:00
Jay Foad
e45a0c2994 [AMDGPU][RFC] Update isLegalAddressingMode for GFX9 SMEM signed offsets
Differential Revision: https://reviews.llvm.org/D155587
2023-07-21 10:56:43 +01:00
Jay Foad
787bef0bee [AMDGPU] Add tests for SMEM addressing modes in CodeGenPrepare
Differential Revision: https://reviews.llvm.org/D155854
2023-07-21 10:56:43 +01:00
Matt Arsenault
d33ab05467 AMDGPU: Add flag to disable fdiv processing in IR pass
We kind of have to have multiple implementations of fdiv split between
the two selectors with some pre-processing. Add yet another test to
check for consistency of interpretation of flag combinations. We have
quite a bit of test redundancy here already, but there are so many
possible interesting permutations it's unwieldy to cover every detail
in any one of them. We have a number of overlapping fdiv tests but
it's hard to follow everything going on as it is.
2023-07-20 19:51:15 -04:00
Matt Arsenault
b2d58b596c AMDGPU: Expand rsq testing to cover contract flag
The 1.0/sqrt(x) -> rsq(x) fold increases precision and probably needs
a contract flag.
2023-07-20 19:51:15 -04:00
Matt Arsenault
fb54afd1b7 AMDGPU: Fold fsub [+-0] into fneg when folding source modifiers
This isn't always folded to fneg for a freestanding fsub depending on
the denormal mode. When matching source modifiers, we're implicitly
canonicalizing the input so we can fold it here.

Doesn't bother handling the VOP3P case since it's only relevant with
DAZ, which nobody really uses with f16.

For f64, tests show an existing bug where DAGCombiner tries to respect
the denormal mode for fsub -0, x, but not after it's lowered to fadd
-0, (fneg x). Either the fold is wrong or we shouldn't restrict the
fsub case based on the denormal mode.

https://reviews.llvm.org/D155652
2023-07-20 19:29:40 -04:00
Matt Arsenault
881e9f2934 AMDGPU: Regenerate test checks
Mostly a workaround for recent reverts in update_test_checks
2023-07-20 19:26:35 -04:00
Matt Arsenault
ca34f1bdcd AMDGPU: Add baseline test for folding fsub into fneg modifiers 2023-07-20 18:29:35 -04:00
Matt Arsenault
0295513238 AMDGPU: Filter out contract flags when lowering exp
It is unsafe to contract the fsub into the fmul. It also increases
code size by duplicating a constant.
2023-07-20 18:14:24 -04:00
Matt Arsenault
076bc374fc AMDGPU: Add some new baseline tests for exp lowering 2023-07-20 18:14:24 -04:00
Jingu Kang
351b4c17dd Revert "[MachineLICM] Handle Subloops"
This reverts commit 50dd383d08670960540fecb4b48c0f0429fbfba3.
2023-07-20 17:12:25 +01:00
Jingu Kang
50dd383d08 [MachineLICM] Handle Subloops
Following discussion on https://reviews.llvm.org/D154205, make MachineLICM pass
handle subloops with only visiting outmost loop's blocks once.

Differential Revision: https://reviews.llvm.org/D154205
2023-07-20 16:39:13 +01:00
Johannes Doerfert
d015018cb7 [AMDGPUAttributor][FIX] No endless recursion for recursive initializers
Fixes: https://github.com/llvm/llvm-project/issues/63956
2023-07-19 10:27:01 -07:00
Ivan Kosarev
1b32427213 [AMDGPU] Combine the SDAG and GISel versions of the fmed3.ll test.
Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D155590
2023-07-19 11:42:36 +01:00
Jay Foad
7fa7a08f21 [AMDGPU] Insert s_nop before s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
Differential Revision: https://reviews.llvm.org/D155681
2023-07-19 10:33:11 +01:00
Jingu Kang
62ed3ff4bb Revert "[MachineLICM] Handle Subloops"
This reverts commit 33e60484d750291e99301e29e60fe72c8fa48ccd.
2023-07-19 10:30:50 +01:00
Jay Foad
4389f9b2ad [AMDGPU] Regenerate is.fpclass checks 2023-07-19 09:32:22 +01:00
Matt Arsenault
c28e09c8d1 AMDGPU: Preserve flags in fdiv_fast lowering
We were dropping the flags and thus blocking contract into potential
fadd users. GlobalISel was already preserving the flags here.

https://reviews.llvm.org/D155443
2023-07-18 06:57:07 -04:00
Matt Arsenault
4a81283b94 AMDGPU: Generate and add fdiv tests
Prepare for new lowering strategies because we somehow didn't have
enough of them already.
2023-07-18 06:38:05 -04:00
Matt Arsenault
cdfdfe7ccc AMDGPU: Add some additional rcp/rsq tests 2023-07-18 06:37:15 -04:00
Matt Arsenault
3f8ef57bed MachineSink: Fix sinking VGPR def out of a divergent loop
This fixes sinking a VGPR def out of a loop past the reconvergence
point at the SI_END_CF. There was a prior fix which introduced
blockPrologueInterferes (D121277) to fix the same basic problem for
the post RA sink. This also had the special case isIgnorableUse case
which was incorrect, because in some contexts the exec use is not
ignorable.

I'm thinking about a new way to represent this which will avoid
needing hasIgnorableUse and isBasicBlockPrologue, which would function
more like the exception handling.

Fixes: SWDEV-407790

https://reviews.llvm.org/D155343
2023-07-18 06:15:50 -04:00
Matt Arsenault
d5ab379506 AMDGPU: Add baseline test for broken machine sinking 2023-07-18 06:15:50 -04:00
Konstantina Mitropoulou
4c42ab1199 [DAGCombiner] Change foldAndOrOfSETCC() to optimize and/or patterns
CMP(A,C)||CMP(B,C) => CMP(MIN/MAX(A,B), C)
CMP(A,C)&&CMP(B,C) => CMP(MIN/MAX(A,B), C)

This first patch handles integer types.

Reviewed By: craig.topper

Differential Revision: https://reviews.llvm.org/D153502
2023-07-17 17:13:47 -07:00
Konstantina Mitropoulou
11cd92a70f [NFC] Tests for future commit in DAGCombiner
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D153479
2023-07-17 17:08:32 -07:00
Matt Arsenault
04185f0b0b AMDGPU: Fix broken denormal constant folding of canonicalize
This needs to consider the dynamic denormal mode. It should be
possible to implement a runtime DAZ check with a canonicalize.
2023-07-17 19:54:20 -04:00