6243 Commits

Author SHA1 Message Date
Austin Kerbow
864a2b25be [AMDGPU] Reserve extra SGPR blocks wth XNACK "any" TID Setting
ASMPrinter was relying on feature bits to setup extra SGRPs in the knerel
descriptor for the xnack_mask. This was broken for the dynamic XNACK "any" TID
setting which could cause user SGPRs to be clobbered if the number of SGPRs
reserved was near a granulated block boundary.

When XNACK was enabled this worked correctly in the ASMParser which meant some
kernels were only failing without "-save-temps".

Fixes: SWDEV-382764

Reviewed By: kzhuravl

Differential Revision: https://reviews.llvm.org/D145401
2023-03-17 20:26:23 -07:00
Matt Arsenault
9356ec1516 CodeGen: Reorder case handling for is.fpclass legalization
Subnormal and zero checks can be combined into one, so move
the code closer to reduce the diff in a future change.
2023-03-17 11:29:50 -04:00
Vitaly Buka
aa15fe98b6 Revert "[AMDGPUUnifyDivergentExitNodes] Add NewPM support"
Introduces nullptr dereference.

This reverts commit a5455e32b364dabe499ec11722626d4bbaf047ba.
2023-03-16 19:03:46 -07:00
Mirko Brkusanin
d5c0c1b6f0 [AMDGPU] Select flat atomic fmin/fmax
Also disables global atomic fmin/fmax x2 patterns on gfx11

Differential Revision: https://reviews.llvm.org/D146137
2023-03-16 18:07:26 +01:00
Anshil Gandhi
a5455e32b3 [AMDGPUUnifyDivergentExitNodes] Add NewPM support
Meanwhile, use UniformityAnalysis instead of LegacyDivergenceAnalysis to collect divergence info.

Reviewed By: arsenm, sameerds

Differential Revision: https://reviews.llvm.org/D141355
2023-03-16 16:13:29 +00:00
Nikita Popov
bbfb13a5ff [ConstExpr] Remove select constant expression
This removes the select constant expression, as part of
https://discourse.llvm.org/t/rfc-remove-most-constant-expressions/63179.
Uses of this expressions have already been removed in advance,
so this just removes related infrastructure and updates tests.

Differential Revision: https://reviews.llvm.org/D145382
2023-03-16 10:32:08 +01:00
Konstantina Mitropoulou
6bc5aa592a [AMDGPU] Update mul.ll with auto-generated checks
Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D145990
2023-03-15 08:16:28 -07:00
pvanhout
723a53caaf [AMDGPU] Avoid constant bus limitation on V_BFE GISel pattern
For D141247 - if that pattern was used by GISel it could cause constant bus limitation failures.
Just use inline immediates instead of S_MOV to avoid the issue.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D146131
2023-03-15 15:01:33 +01:00
pvanhout
f90849dfa3 [AMDGPU] Use UniformityAnalysis in AtomicOptimizer
Adds & uses a new `isDivergentUse` API in UA.
UniformityAnalysis now requires CycleInfo as well as the new temporal divergence API can query it.

-----

Original patch that adds `isDivergentUse` by @sameerds

The user of a temporally divergent value is marked as divergent in the
uniformity analysis. But the same user may also have been marked divergent for
other reasons, thus losing this information about temporal divergence. But some
clients need to specificly check for temporal divergence. This change restores
such an API, that already existed in DivergenceAnalysis.

Reviewed By: sameerds, foad

Differential Revision: https://reviews.llvm.org/D146018
2023-03-15 09:39:55 +01:00
pvanhout
64b45db34a [AMDGPU] Select v_sat_pk_u8_i16
The backend knew about `v_sat_pk_u8_i16` but never made use of it.
This patch adds selection patterns (DAG/GISel) for that instruction.
I think it'll be very rarely used, but at least it's possible to use it.

Solves #58266 (https://github.com/llvm/llvm-project/issues/58266)

Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D144729
2023-03-15 09:36:12 +01:00
Matt Arsenault
cd60bff329 CodeGen: Add some additional is_fpclass lowering tests
Cover more cases in preparation for making greater use
of fcmp based lowerings. Also add more tests for the inverted
cases. Test iszero | isnan test masks. We should probably just
generate every combination of test masks.
2023-03-15 01:13:08 -04:00
Simon Pilgrim
4bf004e07e [DAG] Fold (bitcast (logicop (bitcast x), (c))) -> (logicop x, (bitcast c)) iff the current logicop type is illegal
Try to remove extra bitcasts around logicops if we're dealing with illegal types

Fixes the regressions in D145939

Differential Revision: https://reviews.llvm.org/D146032
2023-03-14 14:41:11 +00:00
pvanhout
1f1fea6c38 Reland: [DAG/AMDGPU] Use UniformityAnalysis in DAGISel
Switch DAGISel over to UniformityAnalysis, which was one of the last remaining users of the DivergenceAnalysis.
No explosions seen during internal testing so this looks like a smooth transition.

Reviewed By: sameerds

Differential Revision: https://reviews.llvm.org/D145918
2023-03-14 14:38:45 +01:00
pvanhout
0ea6f0e158 [AMDGPU] Don't run llc-pipeline.ll when expensive_checks are enabled
AMDGPU ISel can add extra passes when expensive checks are enabled. This means the pipeline can be reordered and the checks may fail.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D146038
2023-03-14 14:12:36 +01:00
pvanhout
0e79106fc9 Revert "[DAG/AMDGPU] Use UniformityAnalysis in DAGISel"
This reverts commit 0022b5803fd4f5a4e9fcf233267c0ffa1b88f763.
2023-03-14 11:48:58 +01:00
pvanhout
0022b5803f [DAG/AMDGPU] Use UniformityAnalysis in DAGISel
Switch DAGISel over to UniformityAnalysis, which was one of the last remaining users of the DivergenceAnalysis.
No explosions seen during internal testing so this looks like a smooth transition.

Reviewed By: sameerds

Differential Revision: https://reviews.llvm.org/D145918
2023-03-14 11:18:28 +01:00
Chen Zheng
4f0ed16a46 Reland rGf35a09daebd0a90daa536432e62a2476f708150d and rG63854f91d3ee1056796a5ef27753648396cac6ec
[DAGCombiner] handle more store value forwarding

When lowering calls on target like PPC, some stack loads
will be generated for by value parameters. Node CALLSEQ_START
prevents such loads from being combined.

Suggested by @RolandF, this patch removes the unnecessary
loads for the byval parameter by extending ForwardStoreValueToDirectLoad

Reviewed By: nemanjai, RolandF

Differential Revision: https://reviews.llvm.org/D138899
2023-03-12 21:59:18 -04:00
Simon Pilgrim
f759275c1c [AMDGPU] Regenerate sdwa-peephole.ll 2023-03-12 13:50:25 +00:00
Jon Chesterfield
d3dda422bf [amdgpu][nfc] Replace ad hoc LDS frame recalculation with absolute_symbol MD
Post ISel, LDS variables are absolute values. Representing them as
such is simpler than the frame recalculation currently used to build assembler
tables from their addresses.

This is a precursor to lowering dynamic/external LDS accesses from non-kernel
functions.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D144221
2023-03-12 13:47:48 +00:00
Simon Pilgrim
b53ea2b9c5 [DAG] visitAND - fold (and (any_ext V), c) -> (zero_ext (and (trunc V), c)) if profitable.
Try to more aggressively narrow masks of extended values.

This is mainly for cases where the mask is trying to zero out any_extended upper bits, assuming we can zext/trunc the values for free.

This catches a few actual missed folds, as well as helps canonicalize a number of other cases which were being caught in isel etc.

Differential Revision: https://reviews.llvm.org/D145866
2023-03-12 13:25:23 +00:00
Mirko Brkusanin
2eada459c7 [AMDGPU][MachineVerifier] Fix vdata reg count for MIMG d16
Differential Revision: https://reviews.llvm.org/D145785
2023-03-10 14:47:49 +01:00
Max Kazantsev
6b03ce374e [LICM] Simplify (X < A && X < B) into (X < MIN(A, B)) if MIN(A, B) is loop-invariant
We don't do this transform in InstCombine in general case for arbitrary values, because cost of
AND and 2 ICMP's isn't higher than of MIN and ICMP. However, LICM also has a notion
about the loop structure. This transform becomes profitable if `A` and `B` are loop-invariant and
`X` is not: by doing this, we can compute min outside the loop.

Differential Revision: https://reviews.llvm.org/D143726
Reviewed By: nikic
2023-03-10 17:36:52 +07:00
Max Kazantsev
279f0c02ad [Test] Regenerate tests using update_llc_test_checks.py 2023-03-10 11:34:16 +07:00
Valery Pykhtin
8f6c47b7a4 [AMDGPU] Speedup GCNDownwardRPTracker::advanceBeforeNext
The function makes liveness tests for the entire live register set for every instruction it passes by.
This becomes very slow on high RP regions such as ASAN enabled code.

Instead only uses of last tracked instruction should be tested and this greatly improves compilation time.

This patch revealed few bugs in SIFormMemoryClauses and PreRARematStage::sinkTriviallyRematInsts which should
be fixed first.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D136267
2023-03-09 15:18:02 +01:00
Diana Picus
99d053a97d AMDGPU: Update checks for a couple of tests. NFC 2023-03-09 15:09:19 +01:00
Petar Avramovic
ded69779be Fix SGPR + VGPR + offset Scratch offset folding
Values in SGPR and VGPR register are treated as unsigned by hardware.

When value in 32-bit SGPR or VGPR base can be negative calculate offset
using 32-bit add instructions, otherwise use
sgpr(unsigned) + vgpr(unsigned) + offset.

LoopStrengthReduce.cpp changes offsets to negative and in some
iterations value in SGPR or VGPR register could be negative.

Differential Revision: https://reviews.llvm.org/D144957
2023-03-09 10:53:41 +01:00
Petar Avramovic
3ae310d0ae Fix VGPR + offset Scratch offset folding
Values in VGPR register are treated as unsigned by hardware.

When value in 32-bit VGPR base can be negative calculate offset using
32-bit add instruction, otherwise use vgpr base(unsigned) + offset.
Does not affect case where whole offset comes from VGPR register
(immediate offset is 0).

LoopStrengthReduce.cpp changes offsets to negative and in some
iterations value in VGPR register could be negative.

Differential Revision: https://reviews.llvm.org/D144956
2023-03-09 10:52:44 +01:00
Petar Avramovic
5e56d59999 Fix SGPR + offset Scratch offset folding
Values in SGPR register are treated as unsigned by hardware.

When value in 32-bit SGPR base can be negative calculate offset using
32-bit add instruction, otherwise use sgpr base(unsigned) + offset.
Does not affect case where whole offset comes from SGPR register
(immediate offset is 0).

LoopStrengthReduce.cpp changes offsets to negative and in some
iterations value in SGPR register could be negative.

Differential Revision: https://reviews.llvm.org/D144955
2023-03-09 10:52:44 +01:00
Stanislav Mekhanoshin
e7ec123c6a [AMDGPU] Implement idempotent atomic lowering
This turns an idempotent atomic operation into an atomic load.

Fixes: SWDEV-385135

Differential Revision: https://reviews.llvm.org/D144759
2023-03-08 14:09:59 -08:00
Stanislav Mekhanoshin
59162e3859 [AMDGPU] Skip buffer_wbl2 before atomic fence acquire
Memory models for gfx90a and gfx940 do not require buffer_wbl2
before the fence for acquire ordering, but we do insert the full
release.

Fixes: SWDEV-386785

Differential Revision: https://reviews.llvm.org/D145524
2023-03-08 01:24:20 -08:00
Christudasan Devadasan
2171f04c12 [AMDGPU] Extend WorkGroupID* codegen for compute shaders
Currently, the codegen support for llvm.amdgcn.workgroup.id*
intrinsics are enabled only for compute kernels. In addition,
this patch enables their selection for compute shaders on
subtargets that have architected SGPRs.

Differential Revision: https://reviews.llvm.org/D145045
2023-03-08 07:36:19 +05:30
Florian Hahn
7019624ee1
[SCEV] Strengthen nowrap flags via ranges for ARs on construction.
At the moment, proveNoWrapViaConstantRanges is only used when creating
SCEV[Zero,Sign]ExtendExprs. We can get significant improvements by
strengthening flags after creating the AddRec.

I'll also share a follow-up patch that removes the code to strengthen
flags when creating SCEV[Zero,Sign]ExtendExprs. Modifying AddRecs while
creating those can lead to surprising changes.

Compile-time looks neutral:
https://llvm-compile-time-tracker.com/compare.php?from=94676cf8a13c511a9acfc24ed53c98964a87bde3&to=aced434e8b103109104882776824c4136c90030d&stat=instructions:u

Reviewed By: mkazantsev, nikic

Differential Revision: https://reviews.llvm.org/D144050
2023-03-07 17:10:34 +01:00
pvanhout
edca49cfb7 [AMDGPU] Match med3 for (max (min ..))
We previously only matched (min (max ...))

Depends on D144728

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D145159
2023-03-07 11:14:31 +01:00
pvanhout
764e39048c [AMDGPU] Precommit test: v_sat_pk_u8_i16.ll
Differential Revision: https://reviews.llvm.org/D144728
2023-03-07 11:07:13 +01:00
Jay Foad
5281f5c1e6 [AMDGPU] Add GFX9,GFX10,GFX11 checks for llvm.amdgcn.s.buffer.load 2023-03-06 18:19:50 +00:00
Jay Foad
e73d3150b1 [AMDGPU] Generate checks for llvm.amdgcn.s.buffer.load 2023-03-06 18:19:50 +00:00
pvanhout
036431e31e [AMDGPU] Use UniformityAnalysis in LateCodeGenPrepare
Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D145366
2023-03-06 13:35:57 +01:00
pvanhout
dbebebf6f6 [AMDGPU] Use UniformityAnalysis in CodeGenPrepare
A little extra change was needed in UA because it didn't consider
InvokeInst and it made call-constexpr.ll assert.

Reviewed By: sameerds, arsenm

Differential Revision: https://reviews.llvm.org/D145358
2023-03-06 13:26:51 +01:00
Jay Foad
271010bf50 [AMDGPU] Restore temporal divergence in test
The loop in this test was supposed to have temporal divergence but this
was broken by r367221. Fix it.
2023-03-06 12:09:52 +00:00
pvanhout
7a5d850da2 [AMDGPU] Use UniformityAnalysis in RewriteUndefsForPHI
Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D145359
2023-03-06 12:15:33 +01:00
Matt Arsenault
9f4746b65f AMDGPU: Combine down fcopysign f64 magnitude
Copy through the low bits and only apply an f32
copysign to the high half. This is effectively
what we do for codegen anyway, but this provides
some combine benefits. The cases involving constants
show some small improvements.

https://reviews.llvm.org/D142682
2023-03-06 05:54:25 -04:00
Matt Arsenault
606a62ce27 AMDGPU: Force sign operand of f64 fcopysign to f32
The fcopysign DAG operation, unlike the IR one, allows
different types for the sign and magnitude. We can reduce
the bitwidth of the high operand since only the sign bit matters.

The default combine only introduces mixed fcopysign
operand types from fpext/fptrunc. We effectively do this
already during selection, but doing it earlier in the combiner
should expose new combine opportunities (e.g. the existing tests
now eliminate the load of the low half of the double). Unfortunately
this isn't enough to handle the case I'm interested in just yet.
2023-03-05 19:54:13 -04:00
Matt Arsenault
bd1f7c417f AMDGPU: Try to push fneg as integer into select
I initially attempted to select the source modifier from xor of
a sign mask. This proved to be more difficult since
foldBinOpIntoSelect does not consider free fneg of integers
and undoes the combine.
2023-03-05 18:53:16 -04:00
Jeffrey Byrnes
b89236a96f [AMDGPU] Vectorize misaligned global loads & stores
Based on experimentation on gfx906,908,90a and 1030, wider global loads / stores are more performant than multiple narrower ones independent of alignment -- this is especially true when combining 8 bit loads / stores, in which case speedup was usually 2x across all alignments.

Differential Revision: https://reviews.llvm.org/D145170

Change-Id: I6ee6c76e6ace7fc373cc1b2aac3818fc1425a0c1
2023-03-03 13:18:25 -08:00
Jay Foad
7442f8635b [AMDGPU] Fix invalid instid value in s_delay_alu instruction
Differential Revision: https://reviews.llvm.org/D145232
2023-03-03 21:08:26 +00:00
Jay Foad
08bdff862c [AMDGPU] Fix error message for illegal copy 2023-03-03 11:46:01 +00:00
Jay Foad
f5ab447cf6 [AMDGPU] Add test case for AMDGPUInsertDelayAlu bug 2023-03-03 11:08:39 +00:00
Petar Avramovic
c77bd1fe15 AMDGPU: Add more flat scratch load and store tests for 8 and 16-bit types
Add tests for more complicated scratch load and store patterns.
Includes:
- sign and zero extending loads of i8 and i16 to i32 into 32-bit register
- D16 instructions that affect only high or low 16 bits of 32-bit register
 - D16 sign and zero extending loads of i8 to i16 into high or low 16 bits
   of 32-bit register
 - D16 loads of i16 to high or low 16 bits of 32-bit register
 - D16 stores of i8 and i16 from high 16 bits of 32-bit register

Differential Revision: https://reviews.llvm.org/D145081
2023-03-02 13:20:14 +01:00
Anshil Gandhi
7474cd3e2e [SIAnnotateControlFlow] Use Uniformity analysis
Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D145013
2023-03-01 10:19:45 -07:00
Anshil Gandhi
1b52c7be91 [AMDGPUUnifyDivergentExitNodes] Use Uniformity Analysis
Reviewed By: foad

Differential Revision: https://reviews.llvm.org/D145018
2023-03-01 10:17:11 -07:00