115 Commits

Author SHA1 Message Date
Piotr Sobczak
fac093dd08
[AMDGPU] Update IEEE and DX10_CLAMP for GFX12 (#75030)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2023-12-13 13:52:40 +01:00
Pierre van Houtryve
8a66510fa7
[AMDGPU] Don't create mulhi_24 in CGP (#72983)
Instead, create a mul24 with a 64 bit result and let ISel take care of
it.

This allows patterns to simply match mul24 even for 64-bit muls instead of having to match both mul/mulhi and a buildvector/bitconvert/etc.
2023-11-30 08:26:45 +01:00
Matt Arsenault
231aa0f212 AMDGPU: Avoid creating vector extracts if we aren't going to do anything
Try to avoid expensive checks failures from reporting no changes
when some dead instructions were introduced.
2023-09-13 09:45:34 +03:00
Matt Arsenault
72a7024add AMDGPU: Correctly lower llvm.sqrt.f32
Make codegen emit correctly rounded sqrt by default.

Emit the fast but only kind of fast expansion in AMDGPUCodeGenPrepare
based on !fpmath, like the fdiv case. Hack around visitation ordering
problems from AMDGPUCodeGenPrepare using forward iteration instead of
a well behaved combiner.

https://reviews.llvm.org/D158129
2023-09-12 23:22:54 +03:00
Matt Arsenault
6012fed6f5 AMDGPU: Fix sqrt fast math flags spreading to fdiv fast math flags
This was working around the lack of operator| on FastMathFlags. We
have that now which revealed the bug.
2023-08-30 11:53:05 -04:00
Matt Arsenault
a738bdf35e AMDGPU: Permit more rsq formation in AMDGPUCodeGenPrepare
We were basing the defer the fast case to codegen based on the fdiv
itself, and not looking for a foldable sqrt input.

https://reviews.llvm.org/D158127
2023-08-23 20:06:50 -04:00
pvanhout
b7503ae8a0 [AMDGPU] Clear BreakPhiNodesCache in-between functions
Otherwise stale pointers pollute the cache and
when a dead PHI's memory is reused for another PHI, we can get a false positive hit in the cache.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D157711
2023-08-11 15:23:41 +02:00
pvanhout
62ea799e6c [AMDGPU] Break Large PHIs: Take whole PHI chains into account
Previous heuristics had a big flaw: they only looked at single PHI at a time, and didn't take into account the whole "chain".
The concept of "chain" is important because if we only break a chain partially, we risk forcing regalloc to reserve twice as many registers for that vector.
We also risk adding a lot of copies that shouldn't be there and can inhibit backend optimizations.

The solution I found is to consider the whole "PHI chain" when looking at PHI.
That is, we recursively look at the PHI's incoming value & users for other PHIs, then make a decision about the chain as a whole.

The currrent threshold requires that at least `ceil(chain size * (2/3))` PHIs have at least one interesting incoming value.
In simple terms, two-thirds (rounded up) of the PHIs should be breakable.

This seems to work well. A lower threshold such as 50% is too aggressive because chains can often have 7 or 9 PHIs, and breaking 3+ or 4+ PHIs in those case often causes performance issue.

Fixes SWDEV-409648, SWDEV-398393, SWDEV-413487

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D156414
2023-08-03 16:41:11 +02:00
Kazu Hirata
03612b2c1a [AMDGPU] Fix an unused variable warning
This patch fixes:

  llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp:1006:9: error:
  unused variable 'Ty' [-Werror,-Wunused-variable]
2023-07-21 16:14:41 -07:00
Matt Arsenault
6398b687c5 AMDGPU: Fix variables only used in asserts 2023-07-21 18:55:42 -04:00
Matt Arsenault
8406c3568a AMDGPU: Implement new 2ulp fdiv lowering
Extends the new frexp scaled reciprocal to the general case. The
reciprocal case is just the same thing when frexp of 1 is constant
folded. Could probably clean up the code to rely on that constant
folding.

Improves results for the IEEE path for the default OpenCL division. We
used to only emit the fdiv.fast intrinsic with a 2.5 ulp accuracy
threshold with DAZ, which uses explicit range checks. This gives us a
better fast option with the default IEEE behavior.
2023-07-21 18:55:42 -04:00
Matt Arsenault
6699c37028 AMDGPU: Refactor AMDGPUCodeGenPrepare fdiv handling
NFC-ish. Does trigger some reordering of the fdiv scalarization. Also
skips scalarizing in more cases where nothing was going to happen. We
can still scalarize in some no-op edge cases.

https://reviews.llvm.org/D155740
2023-07-21 18:55:42 -04:00
Matt Arsenault
8287f3af9d AMDGPU: Overhaul and improve rcp and rsq f32 formation
The highlight change is a new denormal safe 1ulp lowering which uses
rcp after using frexp to perform input scaling. This saves 2
instructions compared to other implementations which performed an
explicit denormal range change. This improves the OpenCL default, and
requires a flag for HIP. I don't believe there's any flag wired up for
OpenMP to emit the necessary fpmath metadata.

This provides several improvements and changes that were hard to
separate without regressing one case or another. Disturbingly the
OpenCL conformance test seems to have the reciprocal test commented
out. I locally hacked it back in to test this.

Starts introducing f32 rsq intrinsics in AMDGPUCodeGenPrepare. Like
the rcp case, we could do this in codegen if !fpmath were preserved
(although we would lose some computeKnownFPClass tricks). Start
requiring contract flags to form rsq. The rsq fusion actually improves
the result from ~2ulp to ~1ulp. We have some older fusion in codegen
which only keys off unsafe math which should be refined.

Expand rsq patterns by checking for denormal inputs and pre/post
multiplying like the current library code does. We also take advantage
of computeKnownFPClass to avoid the scaling when we can statically
prove the input cannot be a denormal. We could do the same for the rcp
case, but unlike rsq a large input can underflow to denormal. We need
additional upper bound exponent checks on the input in order to do the
same for rcp.

This rsq handling also now starts handling the negated case. We
introduce rsq with an fneg. In the case the fneg doesn't fold into its
user, it's a neutral change but provides improvement if it is foldable
as a source modifier.

Also starts respecting the arcp attribute properly, and more strictly
interprets afn. We were previously interpreting afn as implying you
could do the reciprocal expansion of an fdiv. The codegen handling of
these also needs to be revisited.

This also effectively introduces the optimization
combineRepeatedFPDivisors enables, just done in the IR instead (and
only for f32).

This is almost across the board better. The one minor regression is
for gfx6/buggy frexp case where for multiple reciprocals, we could
previously reuse rematerialized constants per instance (it's neutral
for a single rcp).

The fdiv.fast and sqrt handling need to be revisited next.

https://reviews.llvm.org/D155593
2023-07-21 16:35:53 -04:00
Matt Arsenault
d33ab05467 AMDGPU: Add flag to disable fdiv processing in IR pass
We kind of have to have multiple implementations of fdiv split between
the two selectors with some pre-processing. Add yet another test to
check for consistency of interpretation of flag combinations. We have
quite a bit of test redundancy here already, but there are so many
possible interesting permutations it's unwieldy to cover every detail
in any one of them. We have a number of overlapping fdiv tests but
it's hard to follow everything going on as it is.
2023-07-20 19:51:15 -04:00
pvanhout
e5296c52e5 [AMDGPU] Relax restrictions on unbreakable PHI users in BreakLargePHis
The previous heuristic rejected a PHI if one of its user was an unbreakable PHI, no matter what the other users were.

This worked well in most cases, but there's one case in rocRAND where
it doesn't work. In that case, a PHI node has 2 PHI users where one is
breakable but not the other. When that PHI node isn't broken performance falls by 35%.

Relaxing the restriction to "require that  half of the PHI node users are breakable" fixes the issue, and seems like a sensible change.

Solves SWDEV-409648, SWDEV-398393

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D155184
2023-07-14 09:02:51 +02:00
Matt Arsenault
fbe4ff8149 AMDGPU: Partially fix not respecting dynamic denormal mode
The most notable issue was producing v_mad_f32 in functions with the
dynamic mode, since it just ignores the mode. fdiv lowering is still
somewhat broken because it involves a mode switch and we need to query
the original mode.
2023-07-11 15:14:52 -04:00
Matt Arsenault
64d325454b AMDGPU: Delete custom combine on class intrinsic
This is no longer necessary as class-with-constant will always be
transformed to the generic class intrinsic.

https://reviews.llvm.org/D153901
2023-07-07 15:28:21 -04:00
Matt Arsenault
9c82dc6a6b AMDGPU: Always use v_rcp_f16 and v_rsq_f16
These inherited the fast math checks from f32, but the manual suggests
these should be accurate enough for unconditional use. The definition
of correctly rounded is 0.5ulp, but the manual says "0.51ulp". I've
been a bit nervous about changing this as the OpenCL conformance test
does not cover half. Brute force produces identical values compared to
a reference host implementation for all values.
2023-07-05 16:53:01 -04:00
Anshil Gandhi
a22ef958cb [AMDGPUCodegenPrepare] Add NewPM Support
Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D151241
2023-05-26 00:20:01 -06:00
pvanhout
fa87dd52d4 [AMDGPU] Handle multiple occurences of an incoming value in break large PHIs
We naively broke all incoming values, assuming they'd be unique.
However it's not illegal to have multiple occurences of, e.g. `[BB0, V0]`
in a PHI node. What's illegal though is having the same basic block
multiple times but with different values, and it's exactly what the
transform caused. This broke in some rare applications where the pattern
arised.

Now we cache the `BasicBlock, Value` pairs we're breaking so we can reuse the values and preserve this invariant.

Solves SWDEV-399460

Reviewed By: #amdgpu, rovka

Differential Revision: https://reviews.llvm.org/D151069
2023-05-22 13:40:26 +02:00
Matt Arsenault
0d0ed9a355 AMDGPU: Pattern match fract instructions in AMDGPUCodeGenPrepare
This will allow eliminating the intrinsic uses in the device
libraries, which will remove a subtarget dependency on the f16
version of the intrinsic.

We previously had some wrong patterns for this under unsafe math
which I've removed.

Do it in IR partially to take advantage of the much better isKnownNeverNaN
handling, and partially out of laziness to avoid repeating this in the DAG
and GlobalISel path. Plus I think this should be done much earlier. Ideally
this would be in InstCombine, but you can't introduce target intrinsics
from a generic instruction rooted pattern.
2023-05-18 23:29:47 +01:00
pvanhout
52a2d07bb3 [AMDGPU] Improve PHI-breaking heuristics in CGP
D147786 made the transform more conservative by adding heuristics,
which was a good idea. However, the transform got a bit
too conservative at times.

This caused a surprise in some rocRAND benchmarks because D143731 greatly helped a few of them.
For instance, a few xorwow-uniform tests saw a +30% boost in performance after that pass, which was lost when D147786 landed.

This patch is an attempt at reaching a middleground that makes
the pass a bit more permissive. It continues in the same spirit as
D147786 but does the following changes:
- PHI users of a PHI node are now recursively checked. When loops are encountered, we consider the PHIs non-breakable. (Considering them breakable had very negative effect in one app I tested)
-  `shufflevector` is now considered interesting, given that it satisfies a few trivial checks.

Reviewed By: arsenm, #amdgpu, jmmartinez

Differential Revision: https://reviews.llvm.org/D150266
2023-05-15 09:16:22 +02:00
Matt Arsenault
6a0d0711ce AMDGPU: Don't try to create pointer bitcasts in load widening 2023-04-29 10:04:33 -04:00
pvanhout
b3b3cb2d2f [AMDGPU] Less aggressively break large PHIs
In some cases, breaking large PHIs can very negatively affect
performance (3x more instructions observed in a particular test case).

This patch adds some basic profitability heuristics to help with some of these issues without affecting the "good" cases.
e.g. avoid breaking PHIs if it causes back-and-forth between vector/scalar form for no good reason.

Fixes SWDEV-392803
Fixes SWDEV-393781
Fixes SWDEV-394228

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D147786
2023-04-14 15:41:26 +02:00
pvanhout
d892521076 [AMDGPU] Break-up large PHIs for DAGISel
DAGISel uses CopyToReg/CopyFromReg to lower PHI nodes. With large PHIs, this can result in poor codegen.
This is because it introduces a need to have a build_vector before copying the PHI value, and that build_vector may have many undef elements. This can cause very high register pressure and abnormal stack usage in some cases.

This scalarization/phi "break-up" can be easily tuned/disabled through CL options in case it's not beneficial for some users.
It's also only enabled for DAGIsel and GlobalISel handles PHIs much better (as it works on the whole function).

This can both scalarize (break a vector into its elements) and simplify (break a vector into smaller, more manageable subvectors) PHIs.

Fixes SWDEV-321581

Reviewed By: kzhuravl

Differential Revision: https://reviews.llvm.org/D143731
2023-03-28 09:38:47 +02:00
pvanhout
dbebebf6f6 [AMDGPU] Use UniformityAnalysis in CodeGenPrepare
A little extra change was needed in UA because it didn't consider
InvokeInst and it made call-constexpr.ll assert.

Reviewed By: sameerds, arsenm

Differential Revision: https://reviews.llvm.org/D145358
2023-03-06 13:26:51 +01:00
Jay Foad
dcb834843e [AMDGPU] Split SIModeRegisterDefaults out of AMDGPUBaseInfo. NFC.
This is only used by CodeGen. Moving it out of AMDGPUBaseInfo simplifies
future changes to make some of it depend on the subtarget.

Differential Revision: https://reviews.llvm.org/D144650
2023-02-23 16:38:15 +00:00
Kazu Hirata
64dad4ba9a Use llvm::bit_cast (NFC) 2023-02-14 01:22:12 -08:00
Jay Foad
6443c0ee02 [AMDGPU] Stop using make_pair and make_tuple. NFC.
C++17 allows us to call constructors pair and tuple instead of helper
functions make_pair and make_tuple.

Differential Revision: https://reviews.llvm.org/D139828
2022-12-14 13:22:26 +00:00
Matt Arsenault
3830e4e58c AMDGPU: Create poison values instead of undef
These placeholders don't care about the finer points on
the difference between the two.
2022-11-16 14:47:24 -08:00
Matt Arsenault
838fd611b7 AMDGPU: Fix assertion on <1 x i16> vectors
Fixes issue 58331.
2022-10-12 17:25:24 -07:00
Nikita Popov
8e70258b18 [AMDGPUCodeGenPrepare] Check result of ConstantFoldBinaryOpOperands()
This function will become fallible once we don't support constant
expressions for all binops, so make sure to check the result.
2022-07-04 14:20:23 +02:00
Sebastian Neubauer
6527b2a4d5 [AMDGPU][NFC] Fix typos
Fix some typos in the amdgpu backend.

Differential Revision: https://reviews.llvm.org/D119235
2022-02-18 15:05:21 +01:00
Craig Topper
cbcbbd6ac8 [ValueTracking][SelectionDAG] Rename ComputeMinSignedBits->ComputeMaxSignificantBits. NFC
This function returns an upper bound on the number of bits needed
to represent the signed value. Use "Max" to match similar functions
in KnownBits like countMaxActiveBits.

Rename APInt::getMinSignedBits->getSignificantBits. Keeping the old
name around to keep this patch size down. Will do a bulk rename as
follow up.

Rename KnownBits::countMaxSignedBits->countMaxSignificantBits.

Reviewed By: lebedev.ri, RKSimon, spatel

Differential Revision: https://reviews.llvm.org/D116522
2022-01-03 11:33:30 -08:00
Craig Topper
361216f3c4 [AMDGPU] Use ComputeMinSignedBits and KnownBits::countMaxActiveBits to simplify some code. NFC
Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D116516
2022-01-03 10:09:51 -08:00
Jay Foad
21a1d4cf71 [AMDGPU] Change numBitsSigned for simplicity and document it. NFC.
Change numBitsSigned to return the minimum size of a signed integer that
can hold the value. This is different by one from the previous result
but is more consistent with numBitsUnsigned. Update all callers. All
callers are now more consistent between the signed and unsigned cases,
and some callers get simpler, especially the ones that deal with
quantities like numBitsSigned(LHS) + numBitsSigned(RHS).

Differential Revision: https://reviews.llvm.org/D112813
2021-10-29 14:22:06 +01:00
Abinav Puthan Purayil
781dd39b7b [AMDGPU] Enable 48-bit mul in AMDGPUCodeGenPrepare.
We were bailing out of creating 24-bit muls for results wider than 32
bits in AMDGPUCodeGenPrepare. With the 24-bit mulhi intrinsic, this
change teaches AMDGPUCodeGenPrepare to generate the 48-bit mul
correctly.

Differential Revision: https://reviews.llvm.org/D112395
2021-10-26 18:53:07 +05:30
Abinav Puthan Purayil
de3038400b [AMDGPU] Avoid redundant calls to numBits in AMDGPUCodeGenPrepare::replaceMulWithMul24().
The isU24() and isI24() calls numBits to make its decision. This change
replaces them with the internal numBits call so that we can use its
result for the > 32 bit width cases.

Differential Revision: https://reviews.llvm.org/D111864
2021-10-15 19:49:44 +05:30
Abinav Puthan Purayil
0379263f23 [AMDGPU] Fix width check for signed mul24 generation.
This changes fixes a case in which the highest set bit of the original
result is at bit 31 and sign-extending the mul24 for it would make the
result negative.

Differential Revision: https://reviews.llvm.org/D111823
2021-10-15 18:53:41 +05:30
Abinav Puthan Purayil
b3c9d84e5a [AMDGPU] Fix 24-bit mul intrinsic generation for > 32-bit result.
The 24-bit mul intrinsics yields the low-order 32 bits. We should only
do the transformation if the operands are known to be not wider than 24
bits and the result is known to be not wider than 32 bits.

Differential Revision: https://reviews.llvm.org/D111523
2021-10-14 09:00:19 +05:30
Jacob Lambert
dc6e8dfdfe [AMDGPU][NFC] Correct typos in lib/Target/AMDGPU/AMDGPU*.cpp files. Test commit for new contributor. 2021-09-20 14:48:50 -07:00
Jay Foad
477b9bc9f7 [AMDGPU] Minor cleanup after D109483. NFC. 2021-09-13 10:27:15 +01:00
Anshil Gandhi
2e5dc4a1ef [AMDGPU] [CodeGen] Fold negate llvm.amdgcn.class into test mask
Implemented the transformation of xor (llvm.amdgcn.class x, mask), -1 into
llvm.amdgcn.class(x, ~mask). Added LIT tests as well.

Differential Revision: https://reviews.llvm.org/D104049
2021-06-18 13:04:12 -06:00
Nikita Popov
9914200393 [CodeGen] Add missing includes (NFC)
These currently rely on the IRBuilder.h include in TargetLowering.h.
Make them explicit.
2021-06-06 15:48:27 +02:00
Serge Guelton
d6de1e1a71 Normalize interaction with boolean attributes
Such attributes can either be unset, or set to "true" or "false" (as string).
throughout the codebase, this led to inelegant checks ranging from

        if (Fn->getFnAttribute("no-jump-tables").getValueAsString() == "true")

to

        if (Fn->hasAttribute("no-jump-tables") && Fn->getFnAttribute("no-jump-tables").getValueAsString() == "true")

Introduce a getValueAsBool that normalize the check, with the following
behavior:

no attributes or attribute set to "false" => return false
attribute set to "true" => return true

Differential Revision: https://reviews.llvm.org/D99299
2021-04-17 08:17:33 +02:00
Matt Arsenault
2a0db8d70e AMDGPU: Use more accurate fast f64 fdiv
A raw v_rcp_f64 isn't accurate enough, so start applying correction.
2021-01-21 10:51:36 -05:00
dfukalov
560d7e0411 [NFC][AMDGPU] Split AMDGPUSubtarget.h to R600 and GCN subtargets
... to reduce headers dependency.

Reviewed By: rampitec, arsenm

Differential Revision: https://reviews.llvm.org/D95036
2021-01-20 22:22:45 +03:00
dfukalov
6a87e9b08b [NFC][AMDGPU] Reduce include files dependency.
Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D93813
2021-01-07 22:22:05 +03:00
Simon Pilgrim
1673a08044 SelectionDAG.h - remove unnecessary FunctionLoweringInfo.h include. NFCI.
Use forward declarations and move the include down to dependent files that actually use it.

This also exposes a number of implicit dependencies on KnownBits.h
2020-09-03 18:33:25 +01:00
Matt Arsenault
75e6f0b3d4 AMDGPU: Add flag to disable promotion of uniform i16 ops
This interferes with GlobalISel's much better handling of the
situation.

This should really be disable for GlobalISel. However, the fallback
only re-runs the selection passes, and doesn't go back and rerun any
codegen IR passes. I haven't come up with a good solution to this
problem.
2020-08-24 14:39:27 -04:00